from:"Michael Armbrust"

Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Michael Armbrust

+1 (binding)

On Mon, Jun 8, 2020 at 1:22 PM DB Tsai  wrote:

> +1 (binding)
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Mon, Jun 8, 2020 at 1:03 PM Dongjoon Hyun 
> wrote:
> >
> > +1
> >
> > Thanks,
> > Dongjoon.
> >
> > On Mon, Jun 8, 2020 at 6:37 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
> >>
> >> +1 (non-binding) ran the new SCC DSV2 suite and all other tests, no
> issues
> >>
> >> On Sun, Jun 7, 2020 at 11:12 PM Yin Huai  wrote:
> >>>
> >>> Hello everyone,
> >>>
> >>> I am wondering if it makes more sense to not count Saturday and
> Sunday. I doubt that any serious testing work was done during this past
> weekend. Can we only count business days in the voting process?
> >>>
> >>> Thanks,
> >>>
> >>> Yin
> >>>
> >>> On Sun, Jun 7, 2020 at 3:24 PM Denny Lee 
> wrote:
> 
>  +1 (non-binding)
> 
>  On Sun, Jun 7, 2020 at 3:21 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > I'm seeing the effort of including the correctness issue SPARK-28067
> [1] to 3.0.0 via SPARK-31894 [2]. That doesn't seem to be a regression so
> technically doesn't block the release, so while it'd be good to weigh its
> worth (it requires some SS users to discard the state so might bring less
> frightened requiring it in major version upgrade), it looks to be optional
> to include SPARK-28067 to 3.0.0.
> >
> > Besides, I see all blockers look to be resolved, thanks all for the
> amazing efforts!
> >
> > +1 (non-binding) if the decision of SPARK-28067 is "later".
> >
> > 1. https://issues.apache.org/jira/browse/SPARK-28067
> > 2. https://issues.apache.org/jira/browse/SPARK-31894
> >
> > On Mon, Jun 8, 2020 at 5:23 AM Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
> >>
> >> +1
> >>
> >> Matei
> >>
> >> On Jun 7, 2020, at 6:53 AM, Maxim Gekk 
> wrote:
> >>
> >> +1 (non-binding)
> >>
> >> On Sun, Jun 7, 2020 at 2:34 PM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> I don't see any ongoing PR to fix critical bugs in my area.
> >>> Bests,
> >>> Takeshi
> >>>
> >>> On Sun, Jun 7, 2020 at 7:24 PM Mridul Muralidharan <
> mri...@gmail.com> wrote:
> 
>  +1
> 
>  Regards,
>  Mridul
> 
>  On Sat, Jun 6, 2020 at 1:20 PM Reynold Xin 
> wrote:
> >
> > Apologies for the mistake. The vote is open till 11:59pm Pacific
> time on Mon June 9th.
> >
> > On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin 
> wrote:
> >>
> >> Please vote on releasing the following candidate as Apache
> Spark version 3.0.0.
> >>
> >> The vote is open until [DUE DAY] and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.0.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> http://spark.apache.org/
> >>
> >> The tag to be voted on is v3.0.0-rc3 (commit
> 3fdfce3120f307147244e5eaf46d61419a723d50):
> >> https://github.com/apache/spark/tree/v3.0.0-rc3
> >>
> >> The release files, including signatures, digests, etc. can be
> found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >>
> https://repository.apache.org/content/repositories/orgapachespark-1350/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
> >>
> >> The list of bug fixes going into 3.0.0 can be found at the
> following URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12339177
> >>
> >> This release is using the release script of the tag v3.0.0-rc3.
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by
> taking
> >> an existing Spark workload and running on this release
> candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and
> install
> >> the current RC and see if anything important breaks, in the
> Java/Scala
> >> you can add the staging repository to your projects resolvers
> and test
> >> with the RC (make sure to clean up

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Michael Armbrust

>
> What I'd oppose is to just ban char for the native data sources, and do
> not have a plan to address this problem systematically.
>

+1


> Just forget about padding, like what Snowflake and MySQL have done.
> Document that char(x) is just an alias for string. And then move on. Almost
> no work needs to be done...
>

+1

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-11 Thread Michael Armbrust

Thank you for the discussion everyone! This vote passes. I'll work to get
this posed on the website.

+1
Michael Armbrust
Sean Owen
Jules Damji
大啊
Ismaël Mejía
Wenchen Fan
Matei Zaharia
Gengliang Wang
Takeshi Yamamuro
Denny Lee
Xiao Li
Xingbo Jiang
Tkuya UESHIN
Hichael Heuer
John Zhuge
Reynold Xin
Burak Yavuz
Holden Karau
Dongjoon Hyun

To respond to some of the questions on the interpretation of this policy:

> Also, can we expand on 'when' an API change can occur ?  Since we are
> proposing to diverge from semver.
> Patch release ? Minor release ? Only major release ? Based on 'impact' of
> API ? Stability guarantees ?

This is an addition to the existing semvar policy. We still do not break
stable APIs at major versions.

This new policy has a good indention, but can we narrow down on the
> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?

I do not think that we should apply this policy to the 3.0 release any
differently than we will for future releases. There is nothing special
about 3.0 that means unnecessary breakages will not be costly to our users.

If I had to summarize the policy in once sentence it would be "Think about
users before you break APIs!". As I mentioned in my original email, I think
in many cases this did not happen in the lead up to this release. Rather,
the reasoning in some cases was that "This is a major release, so we can
break things".

Given that we all agree major versions are not sufficient justification to
break APIs, I think its reasonable to revisit and discuss on a case-by-case
basis, some of the more commonly used, broken APIs in the context of this
rubric.

We had better be more careful when we add a new policy and should aim not
> to mislead the users and 3rd party library developers to say "older is
> better".

Nothing in the policy says "older is better". It only says that age is one
factor to consider when trying to reason about usage. If an API has been
around for a long time, its possible (but not always true) that it will
have more usage than an old API. If usage is low, and the cost to keep it
is high, get rid of it even if its very old.

The policy also explicitly calls out updating docs to recommend the
new/"correct" way of doing things. If you can convince all the users of
Spark to switch, then you can remove any API you want in the future :)

Is this only applying to stable apis?

This is not explicitly called out, but I would argue you should still think
about users, even when breaking experimental APIs. The bar is certainly
lower here, we explicitly called out that these APIs might change. That
said, I still would go though the exercise and decide if the benefits
outweigh the costs before doing it. (Very similar to the discussion before
our 1.0, before any promises of stability had been made).

the way I read this proposal isn't really saying we can't break api's on
> major releases, its just saying spend more time making sure its worth it.

I agree with this interpretation!

Michael

On Tue, Mar 10, 2020 at 10:59 AM Tom Graves  wrote:

> Overall makes sense to me, but have same questions as others on the thread.
>
> Is this only applying to stable apis?
> How are we going to apply to 3.0?
>
> the way I read this proposal isn't really saying we can't break api's on
> major releases, its just saying spend more time making sure its worth it.
> Tom
>
> On Friday, March 6, 2020, 08:59:03 PM CST, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>
> I propose to add the following text to Spark's Semantic Versioning policy
> <https://spark.apache.org/versioning-policy.html> and adopt it as the
> rubric that should be used when deciding to break APIs (even at major
> versions such as 3.0).
>
>
> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a 
> procedural
> vote <https://www.apache.org/foundation/voting.html>, the measure will
> pass if there are more favourable votes than unfavourable ones. PMC votes
> are binding, but the community is encouraged to add their voice to the
> discussion.
>
>
> [ ] +1 - Spark should adopt this policy.
>
> [ ] -1  - Spark should not adopt this policy.
>
>
> 
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>-
>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust

I'll start off the vote with a strong +1 (binding).

On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust 
wrote:

> I propose to add the following text to Spark's Semantic Versioning policy
> <https://spark.apache.org/versioning-policy.html> and adopt it as the
> rubric that should be used when deciding to break APIs (even at major
> versions such as 3.0).
>
>
> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a 
> procedural
> vote <https://www.apache.org/foundation/voting.html>, the measure will
> pass if there are more favourable votes than unfavourable ones. PMC votes
> are binding, but the community is encouraged to add their voice to the
> discussion.
>
>
> [ ] +1 - Spark should adopt this policy.
>
> [ ] -1  - Spark should not adopt this policy.
>
>
> 
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>-
>
>Usage - an API that is actively used in many different places, is
>always very costly to break. While it is hard to know usage for sure, there
>are a bunch of ways that we can estimate:
>-
>
>   How long has the API been in Spark?
>   -
>
>   Is the API common even for basic programs?
>   -
>
>   How often do we see recent questions in JIRA or mailing lists?
>   -
>
>   How often does it appear in StackOverflow or blogs?
>   -
>
>Behavior after the break - How will a program that works today, work
>after the break? The following are listed roughly in order of increasing
>severity:
>-
>
>   Will there be a compiler or linker error?
>   -
>
>   Will there be a runtime exception?
>   -
>
>   Will that exception happen after significant processing has been
>   done?
>   -
>
>   Will we silently return different answers? (very hard to debug,
>   might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We
> must also consider the cost both to the project and to our users of keeping
> the API in question.
>
>-
>
>Project Costs - Every API we have needs to be tested and needs to keep
>working as other parts of the project changes. These costs are
>significantly exacerbated when external dependencies change (the JVM,
>Scala, etc). In some cases, while not completely technically infeasible,
>the cost of maintaining a particular API can become too high.
>-
>
>User Costs - APIs also have a cognitive cost to users learning Spark
>or trying to understand Spark programs. This cost becomes even higher when
>the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also
> high, there are alternatives that should be considered that do not hurt
> existing users but do address some of the maintenance costs.
>
>
>-
>
>Avoid Bad APIs - While this is a bit obvious, it is an important
>point. Anytime we are adding a new interface to Spark we should consider
>that we might be stuck with this API forever. Think deeply about how
>new APIs relate to existing ones, as well as how you expect them to evolve
>over time.
>-
>
>Deprecation Warnings - All deprecation warnings should point to a
>clear alternative and should never just say that an API is deprecated.
>-
>
>Updated Docs - Documentation should point to the "best" recommended
>way of performing a given task. In the cases where we maintain legacy
>documentation, we should clearly point to newer APIs and suggest to users
>the "right" way.
>-
>
>Community Work - Many people learn Spark by reading blogs and other
>sites such as StackOverflow. However, many of these resources are out of
>date. Update them, to reduce the cost of eventually removing deprecated
>APIs.
>
>
> 
>

[VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust

I propose to add the following text to Spark's Semantic Versioning policy
 and adopt it as the
rubric that should be used when deciding to break APIs (even at major
versions such as 3.0).


I'll leave the vote open until Tuesday, March 10th at 2pm. As this is
a procedural
vote , the measure will pass
if there are more favourable votes than unfavourable ones. PMC votes are
binding, but the community is encouraged to add their voice to the
discussion.


[ ] +1 - Spark should adopt this policy.

[ ] -1  - Spark should not adopt this policy.





Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing
behavior, even at major versions. While this is not always possible, the
balance of the following factors should be considered before choosing to
break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark.
A broken API means that Spark programs need to be rewritten before they can
be upgraded. However, there are a few considerations when thinking about
what the cost will be:

   -

   Usage - an API that is actively used in many different places, is always
   very costly to break. While it is hard to know usage for sure, there are a
   bunch of ways that we can estimate:
   -

  How long has the API been in Spark?
  -

  Is the API common even for basic programs?
  -

  How often do we see recent questions in JIRA or mailing lists?
  -

  How often does it appear in StackOverflow or blogs?
  -

   Behavior after the break - How will a program that works today, work
   after the break? The following are listed roughly in order of increasing
   severity:
   -

  Will there be a compiler or linker error?
  -

  Will there be a runtime exception?
  -

  Will that exception happen after significant processing has been done?
  -

  Will we silently return different answers? (very hard to debug, might
  not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We
must also consider the cost both to the project and to our users of keeping
the API in question.

   -

   Project Costs - Every API we have needs to be tested and needs to keep
   working as other parts of the project changes. These costs are
   significantly exacerbated when external dependencies change (the JVM,
   Scala, etc). In some cases, while not completely technically infeasible,
   the cost of maintaining a particular API can become too high.
   -

   User Costs - APIs also have a cognitive cost to users learning Spark or
   trying to understand Spark programs. This cost becomes even higher when the
   API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also
high, there are alternatives that should be considered that do not hurt
existing users but do address some of the maintenance costs.


   -

   Avoid Bad APIs - While this is a bit obvious, it is an important point.
   Anytime we are adding a new interface to Spark we should consider that we
   might be stuck with this API forever. Think deeply about how new APIs
   relate to existing ones, as well as how you expect them to evolve over time.
   -

   Deprecation Warnings - All deprecation warnings should point to a clear
   alternative and should never just say that an API is deprecated.
   -

   Updated Docs - Documentation should point to the "best" recommended way
   of performing a given task. In the cases where we maintain legacy
   documentation, we should clearly point to newer APIs and suggest to users
   the "right" way.
   -

   Community Work - Many people learn Spark by reading blogs and other
   sites such as StackOverflow. However, many of these resources are out of
   date. Update them, to reduce the cost of eventually removing deprecated
   APIs.

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Michael Armbrust

Thanks for the discussion! A few responses:

The decision needs to happen at api/config change time, otherwise the
> deprecated warning has no purpose if we are never going to remove them.
>

Even if we never remove an API, I think deprecation warnings (when done
right) can still serve a purpose. For new users, a deprecation can serve as
a pointer to newer, faster APIs or ones with less sharp edges. I would be
supportive of efforts that use them to clean up the docs. For example, we
could hide deprecated APIs after some time so they don't clutter scala/java
doc. We can and should audit things like the user guide and our own
examples to make sure they don't use deprecated APIs.


> That said we still need to be able to remove deprecated things and change
> APIs in major releases, otherwise why do a  major release in the first
> place.  Is it purely to support newer Scala/python/java versions.
>

I don't think Major versions are purely for
Scala/Java/Python/Hive/Metastore, but they are a good chance to move the
project forward. Spark 3.0 has a lot of upgrades here, and I think we did
make the right trade-offs here, even though there are some API breaks.

Major versions are also a good time to drop major changes (i.e. in 2.0 we
released whole-stage code gen).


> I think the hardest part listed here is what the impact is.  Who's call is
> that, it's hard to know how everyone is using things and I think it's been
> harder to get feedback on SPIPs and API changes in general as people are
> busy with other things.
>

This is the hardest part, and we won't always get it right. I think that
having the rubric though will help guide the conversation and help
reviewers ask the right questions.

One other thing I'll add is, sometimes the users come to us and we should
listen! I was very surprised by the response to Karen's email on this list
last week. An actual user was giving us feedback on the impact of the
changes in Spark 3.0 and rather than listen there was a lot of push back.
Users are never wrong when they are telling you what matters to them!


> Like you mention, I think stackoverflow is unreliable, the posts could be
> many years old and no longer relevant.
>

While this is unfortunate, I think the more we can do to keep these
answer relevant (either by updating them or by not breaking them) is good
for the health of the Spark community.

Re: Clarification on the commit protocol

2020-02-27 Thread Michael Armbrust

No, it is not. Although the commit protocol has mostly been superseded by Delta
Lake , which is available as a separate open source
project that works natively with Apache Spark. In contrast to the commit
protocol, Delta can guarantee full ACID (rather than just partition level
atomicity). It also has better performance in many cases, as it reduces the
amount of metadata that needs to be retrieved from the storage system.

On Wed, Feb 26, 2020 at 9:36 PM rahul c  wrote:

> Hi team,
>
> Just wanted to understand.
> Is DBIO commit protocol available in open source spark version ?
>

[Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-24 Thread Michael Armbrust

Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including
myself), there have been many discussions around APIs that have been broken
compared with Spark 2.x. In many of these discussions, one of the
rationales for breaking an API seems to be "Spark follows semantic
versioning , so this major
release is our chance to get it right [by breaking APIs]". Similarly, in
many cases the response to questions about why an API was completely
removed has been, "this API has been deprecated since x.x, so we have to
remove it".

As a long time contributor to and user of Spark this interpretation of the
policy is concerning to me. This reasoning misses the intention of the
original policy, and I am worried that it will hurt the long-term success
of the project.

I definitely understand that these are hard decisions, and I'm not
proposing that we never remove anything from Spark. However, I would like
to give some additional context and also propose a different rubric for
thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for
the 1.0 release. As this was the first major release -- and as, up until
fairly recently, Spark had only been an academic project -- no real
promises had been made about API stability ever.

During the discussion, some committers suggested that this was an
opportunity to clean up cruft and give the Spark APIs a once-over, making
cosmetic changes to improve consistency. However, in the end, it was
decided that in many cases it was not in the best interests of the Spark
community to break things just because we could. Matei actually said it
pretty forcefully

:

I know that some names are suboptimal, but I absolutely detest breaking
APIs, config names, etc. I’ve seen it happen way too often in other
projects (even things we depend on that are officially post-1.0, like Akka
or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
cutting-edge users are okay with libraries occasionally changing, but many
others will consider it a show-stopper. Given this, I think that any
cosmetic change now, even though it might improve clarity slightly, is not
worth the tradeoff in terms of creating an update barrier for existing
users.

In the end, while some changes were made, most APIs remained the same and
users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
this served the project very well, as compatibility means users are able to
upgrade and we keep as many people on the latest versions of Spark (though
maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and
we should be more conservative rather than less. Today, there are very
likely more Spark programs running than there were at any other time in the
past. Spark is no longer a tool only used by advanced hackers, it is now
also running "traditional enterprise workloads.'' In many cases these jobs
are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency
can be even harder for users, as if the library has not been upgraded to
use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition
to our semantic versioning policy. After discussion and if people agree
this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
the official policy.

Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing
behavior, even at major versions. While this is not always possible, the
balance of the following factors should be considered before choosing to
break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark.
A broken API means that Spark programs need to be rewritten before they can
be upgraded. However, there are a few considerations when thinking about
what the cost will be:

   -

   Usage - an API that is actively used in many different places, is always
   very costly to break. While it is hard to know usage for sure, there are a
   bunch of ways that we can estimate:
   -

  How long has the API been in Spark?
  -

  Is the API common even for basic programs?
  -

  How often do we see recent questions in JIRA or mailing lists?
  -

  How often does it appear in StackOverflow or blogs?
  -

   Behavior after the break - How will a program that works today, work
   after the break? The following are listed roughly in order of increasing
   severity:
   -

  Will there be a compiler or linker error?
  -

  Will there be a runtime exception?
  -

  Will that exception happen after significant

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-21 Thread Michael Armbrust

This plan for evolving the TRIM function to be more standards compliant
sounds much better to me than the original change to just switch the order.
It pushes users in the right direction and cleans up our tech debt without
silently breaking existing workloads. It means that programs won't return
different results when run on Spark 2.x and Spark 3.x.

One caveat:

> If we keep this situation in 3.0.0 release (a major release), it means
>>> Apache Spark will be forever
>>>
>>
I think this ship has already sailed. There is nothing special about 3.0
here. If the API is in a released version of Spark, than the mistake is
already made.

Major releases are an opportunity to break APIs when we *have to*. We
always strive to avoid breaking APIs even if they have not been in an X.0
release.

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-19 Thread Michael Armbrust

+1 (binding), we've test this and it LGTM.

On Thu, Apr 18, 2019 at 7:51 PM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.2.
>
> The vote is open until April 23 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.2-rc1 (commit
> a44880ba74caab7a987128cb09c4bee41617770a):
> https://github.com/apache/spark/tree/v2.4.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1322/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/
>
> The list of bug fixes going into 2.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12344996
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.2?
> ===
>
> The current list of open tickets targeted at 2.4.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust

Thanks Ryan. To me the "test" for putting things in a maintenance release
is really a trade-off between benefit and risk (along with some caveats,
like user facing surface should not grow). The benefits here are fairly
large (now it is possible to plug in partition aware data sources) and the
risk is very low (no change in behavior by default).

And bugs aren't usually fixed with a configuration flag to turn on the fix.


Agree, this should be on by default in master. That would just tip the risk
balance for me in a maintenance release.

On Tue, Apr 16, 2019 at 4:55 PM Ryan Blue  wrote:

> Spark has a lot of strange behaviors already that we don't fix in patch
> releases. And bugs aren't usually fixed with a configuration flag to turn
> on the fix.
>
> That said, I don't have a problem with this commit making it into a patch
> release. This is a small change and looks safe enough to me. I was just a
> little surprised since I was expecting a correctness issue if this is
> prompting a release. I'm definitely on the side of case-by-case judgments
> on what to allow in patch releases and this looks fine.
>
> On Tue, Apr 16, 2019 at 4:27 PM Michael Armbrust 
> wrote:
>
>> I would argue that its confusing enough to a user for options from
>> DataFrameWriter to be silently dropped when instantiating the data source
>> to consider this a bug.  They asked for partitioning to occur, and we are
>> doing nothing (not even telling them we can't).  I was certainly surprised
>> by this behavior.  Do you have a different proposal about how this should
>> be handled?
>>
>> On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue  wrote:
>>
>>> Is this a bug fix? It looks like a new feature to me.
>>>
>>> On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust 
>>> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I know we just released Spark 2.4.1, but in light of fixing SPARK-27453
>>>> <https://issues.apache.org/jira/browse/SPARK-27453> I was wondering if
>>>> it might make sense to follow up quickly with 2.4.2.  Without this fix its
>>>> very hard to build a datasource that correctly handles partitioning without
>>>> using unstable APIs.  There are also a few other fixes that have trickled
>>>> in since 2.4.1.
>>>>
>>>> If there are no objections, I'd like to start the process shortly.
>>>>
>>>> Michael
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust

I would argue that its confusing enough to a user for options from
DataFrameWriter to be silently dropped when instantiating the data source
to consider this a bug.  They asked for partitioning to occur, and we are
doing nothing (not even telling them we can't).  I was certainly surprised
by this behavior.  Do you have a different proposal about how this should
be handled?

On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue  wrote:

> Is this a bug fix? It looks like a new feature to me.
>
> On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust 
> wrote:
>
>> Hello All,
>>
>> I know we just released Spark 2.4.1, but in light of fixing SPARK-27453
>> <https://issues.apache.org/jira/browse/SPARK-27453> I was wondering if
>> it might make sense to follow up quickly with 2.4.2.  Without this fix its
>> very hard to build a datasource that correctly handles partitioning without
>> using unstable APIs.  There are also a few other fixes that have trickled
>> in since 2.4.1.
>>
>> If there are no objections, I'd like to start the process shortly.
>>
>> Michael
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Spark 2.4.2

2019-04-16 Thread Michael Armbrust

Hello All,

I know we just released Spark 2.4.1, but in light of fixing SPARK-27453
 I was wondering if it
might make sense to follow up quickly with 2.4.2.  Without this fix its
very hard to build a datasource that correctly handles partitioning without
using unstable APIs.  There are also a few other fixes that have trickled
in since 2.4.1.

If there are no objections, I'd like to start the process shortly.

Michael

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust

>
> Agree. Just curious, could you explain what do you mean by "negation"?
> Does it mean applying retraction on aggregated?
>

Yeah exactly.  Our current streaming aggregation assumes that the input is
in append-mode and multiple aggregations break this.

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust

Thanks for bringing up some possible future directions for streaming. Here
are some thoughts:
 - I personally view all of the activity on Spark SQL also as activity on
Structured Streaming. The great thing about building streaming on catalyst
/ tungsten is that continued improvement to these components improves
streaming use cases as well.
 - I think the biggest on-going project is DataSourceV2, whose goal is to
provide a stable / performant API for streaming and batch data sources to
plug in.  I think connectivity to many different systems is one of the most
powerful aspects of Spark and right now there is no stable public API for
streaming. A lot of committer / PMC time is being spent here at the moment.
 - As you mention, 2.4.0 significantly improves the built in connectivity
for Kafka, giving us the ability to read exactly once from a topic being
written to transactional producers. I think projects to extend this
guarantee to the Kafka Sink and also to improve authentication with Kafka
are a great idea (and it seems like there is a lot of review activity on
the latter).

You bring up some other possible projects like session window support.
This is an interesting project, but as far as I can tell it still there is
still a lot of work that would need to be done before this feature could be
merged.  We'd need to understand how it works with update mode amongst
other things. Additionally, a 3000+ line patch is really time consuming to
review. This coupled with the fact that all the users that I have
interacted with need "session windows + some custom business logic"
(usually implemented with flatMapGroupsWithState), mean that I'm more
inclined to direct limited review bandwidth to incremental improvements in
that feature than to something large/new. This is not to say that this
feature isn't useful / shouldn't be merge, just a bit of explanation as to
why there might be less activity here than you would hope.

Similarly, multiple aggregations are an often requested feature.  However,
fundamentally, this is going to be a fairly large investment (I think we'd
need to combine the unsupported operation checker and the query planner and
also create a high performance (i.e. whole stage code-gened) aggregation
operator that understands negation).

Thanks again for starting the discussion, and looking forward to hearing
about what features are most requested!

On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim  wrote:

> Adding more: again, it doesn't mean they're feasible to do. Just a kind of
> brainstorming.
>
> * SPARK-20568: Delete files after processing in structured streaming
>   * There hasn't been consensus regarding supporting this: there were
> voices for both YES and NO.
> * Support multiple levels of aggregations in structured streaming
>   * There're plenty of questions in SO regarding this. While I don't think
> it makes sense on structured streaming if it requires additional shuffle,
> there might be another case: group by keys, apply aggregation, apply
> aggregation on aggregated result (grouped keys don't change)
>
> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim 님이 작성:
>
>> Yeah, the main intention of this thread is to collect interest on
>> possible feature list for structured streaming. From what I can see in
>> Spark community, most of the discussions as well as contributions are for
>> SQL, and I'd wish to see similar activeness / efforts on structured
>> streaming.
>> (Unfortunately there's less effort to review others' works - design doc
>> as well as pull request - most of efforts looks like being spent to their
>> own works.)
>>
>> I respect the role of PMC member, so the final decision would be up to
>> PMC members, but contributors as well as end users could show the interest
>> as well as discuss about requirements on SPIP, which could be a good
>> background to persuade PMC members.
>>
>> Before going into the deep I guess we could use this thread to discuss
>> about possible use cases, and if we would like to move forward to
>> individual thread we could initiate (or resurrect) its discussion thread.
>>
>> For queryable state, at least there seems no workaround in Spark to
>> provide similar thing, especially state is getting bigger. I may have some
>> concerns on the details, but I'll add my thought on the discussion thread.
>>
>> - Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com>님이 작성:
>>
>>> Hi Jungtaek,
>>>
>>> I just tried to start the discussion in the dev list along time ago.
>>> I enumerated some uses cases as Michael proposed here
>>> .
>>> The discussion didn't go further.
>>>
>>> If people find it useful we should start discussing it in detail again.
>>>
>>> Stavros
>>>
>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim  wrote:
>>>
 Stavros, if my memory is right, you were

Re: Sorting on a streaming dataframe

2018-04-30 Thread Michael Armbrust

Please open a JIRA then!

On Fri, Apr 27, 2018 at 3:59 AM Hemant Bhanawat <hemant9...@gmail.com>
wrote:

> I see.
>
> monotonically_increasing_id on streaming dataFrames will be really helpful
> to me and I believe to many more users. Adding this functionality in Spark
> would be efficient in terms of performance as compared to implementing this
> functionality inside the applications.
>
> Hemant
>
> On Thu, Apr 26, 2018 at 11:59 PM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> The basic tenet of structured streaming is that a query should return the
>> same answer in streaming or batch mode. We support sorting in complete mode
>> because we have all the data and can sort it correctly and return the full
>> answer.  In update or append mode, sorting would only return a correct
>> answer if we could promise that records that sort lower are going to arrive
>> later (and we can't).  Therefore, it is disallowed.
>>
>> If you are just looking for a unique, stable id and you are already using
>> kafka as the source, you could just combine the partition id and the
>> offset. The structured streaming connector to Kafka
>> <https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html>
>> exposes both of these in the schema of the streaming DataFrame. (similarly
>> for kinesis you can use the shard id and sequence number)
>>
>> If you need the IDs to be contiguous, then this is a somewhat
>> fundamentally hard problem.  I think the best we could do is add support
>> for monotonically_increasing_id() in streaming dataframes.
>>
>> On Tue, Apr 24, 2018 at 1:38 PM, Chayapan Khannabha <chaya...@gmail.com>
>> wrote:
>>
>>> Perhaps your use case fits to Apache Kafka better.
>>>
>>> More info at:
>>> https://kafka.apache.org/documentation/streams/
>>>
>>> Everything really comes down to the architecture design and algorithm
>>> spec. However, from my experience with Spark, there are many good reasons
>>> why this requirement is not supported ;)
>>>
>>> Best,
>>>
>>> Chayapan (A)
>>>
>>>
>>> On Apr 24, 2018, at 2:18 PM, Hemant Bhanawat <hemant9...@gmail.com>
>>> wrote:
>>>
>>> Thanks Chris. There are many ways in which I can solve this problem but
>>> they are cumbersome. The easiest way would have been to sort the streaming
>>> dataframe. The reason I asked this question is because I could not find a
>>> reason why sorting on streaming dataframe is disallowed.
>>>
>>> Hemant
>>>
>>> On Mon, Apr 16, 2018 at 6:09 PM, Bowden, Chris <
>>> chris.bow...@microfocus.com> wrote:
>>>
>>>> You can happily sort the underlying RDD of InternalRow(s) inside a
>>>> sink, assuming you are willing to implement and maintain your own sink(s).
>>>> That is, just grabbing the parquet sink, etc. isn’t going to work out of
>>>> the box. Alternatively map/flatMapGroupsWithState is probably sufficient
>>>> and requires less working knowledge to make effective reuse of internals.
>>>> Just group by foo and then sort accordingly and assign ids. The id counter
>>>> can be stateful per group. Sometimes this problem may not need to be solved
>>>> at all. For example, if you are using kafka, a proper partitioning scheme
>>>> and message offsets may be “good enough”.
>>>> --
>>>> *From:* Hemant Bhanawat <hemant9...@gmail.com>
>>>> *Sent:* Thursday, April 12, 2018 11:42:59 PM
>>>> *To:* Reynold Xin
>>>> *Cc:* dev
>>>> *Subject:* Re: Sorting on a streaming dataframe
>>>>
>>>> Well, we want to assign snapshot ids (incrementing counters) to the
>>>> incoming records. For that, we are zipping the streaming rdds with that
>>>> counter using a modified version of ZippedWithIndexRDD. We are ok if the
>>>> records in the streaming dataframe gets counters in random order but the
>>>> counter should always be incrementing.
>>>>
>>>> This is working fine until we have a failure. When we have a failure,
>>>> we re-assign the records to snapshot ids  and this time same snapshot id
>>>> can get assigned to a different record. This is a problem because the
>>>> primary key in our storage engine is <recordid, snapshotid>. So we want to
>>>> sort the dataframe so that the records always get the same snapshot id.
>>>>
>>>>
>>>>
>>>> On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>> Can you describe your use case more?
>>>>
>>>> On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat <hemant9...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>> Why is sorting on streaming dataframes not supported(unless it is
>>>> complete mode)? My downstream needs me to sort the streaming dataframe.
>>>>
>>>> Hemant
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Sorting on a streaming dataframe

2018-04-26 Thread Michael Armbrust

The basic tenet of structured streaming is that a query should return the
same answer in streaming or batch mode. We support sorting in complete mode
because we have all the data and can sort it correctly and return the full
answer.  In update or append mode, sorting would only return a correct
answer if we could promise that records that sort lower are going to arrive
later (and we can't).  Therefore, it is disallowed.

If you are just looking for a unique, stable id and you are already using
kafka as the source, you could just combine the partition id and the
offset. The structured streaming connector to Kafka

exposes both of these in the schema of the streaming DataFrame. (similarly
for kinesis you can use the shard id and sequence number)

If you need the IDs to be contiguous, then this is a somewhat fundamentally
hard problem.  I think the best we could do is add support
for monotonically_increasing_id() in streaming dataframes.

On Tue, Apr 24, 2018 at 1:38 PM, Chayapan Khannabha 
wrote:

> Perhaps your use case fits to Apache Kafka better.
>
> More info at:
> https://kafka.apache.org/documentation/streams/
>
> Everything really comes down to the architecture design and algorithm
> spec. However, from my experience with Spark, there are many good reasons
> why this requirement is not supported ;)
>
> Best,
>
> Chayapan (A)
>
>
> On Apr 24, 2018, at 2:18 PM, Hemant Bhanawat  wrote:
>
> Thanks Chris. There are many ways in which I can solve this problem but
> they are cumbersome. The easiest way would have been to sort the streaming
> dataframe. The reason I asked this question is because I could not find a
> reason why sorting on streaming dataframe is disallowed.
>
> Hemant
>
> On Mon, Apr 16, 2018 at 6:09 PM, Bowden, Chris <
> chris.bow...@microfocus.com> wrote:
>
>> You can happily sort the underlying RDD of InternalRow(s) inside a sink,
>> assuming you are willing to implement and maintain your own sink(s). That
>> is, just grabbing the parquet sink, etc. isn’t going to work out of the
>> box. Alternatively map/flatMapGroupsWithState is probably sufficient and
>> requires less working knowledge to make effective reuse of internals. Just
>> group by foo and then sort accordingly and assign ids. The id counter can
>> be stateful per group. Sometimes this problem may not need to be solved at
>> all. For example, if you are using kafka, a proper partitioning scheme and
>> message offsets may be “good enough”.
>> --
>> *From:* Hemant Bhanawat 
>> *Sent:* Thursday, April 12, 2018 11:42:59 PM
>> *To:* Reynold Xin
>> *Cc:* dev
>> *Subject:* Re: Sorting on a streaming dataframe
>>
>> Well, we want to assign snapshot ids (incrementing counters) to the
>> incoming records. For that, we are zipping the streaming rdds with that
>> counter using a modified version of ZippedWithIndexRDD. We are ok if the
>> records in the streaming dataframe gets counters in random order but the
>> counter should always be incrementing.
>>
>> This is working fine until we have a failure. When we have a failure, we
>> re-assign the records to snapshot ids  and this time same snapshot id can
>> get assigned to a different record. This is a problem because the primary
>> key in our storage engine is . So we want to sort the
>> dataframe so that the records always get the same snapshot id.
>>
>>
>>
>> On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin 
>> wrote:
>>
>> Can you describe your use case more?
>>
>> On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat 
>> wrote:
>>
>> Hi Guys,
>>
>> Why is sorting on streaming dataframes not supported(unless it is
>> complete mode)? My downstream needs me to sort the streaming dataframe.
>>
>> Hemant
>>
>>
>>
>
>

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Michael Armbrust

+1 all our pipelines have been running the RC for several days now.

On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
wrote:

> +1 (non-binding).
>
> Bests,
> Dongjoon.
>
>
>
> On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
> wrote:
>
>> +1 (non-binding)
>>
>> On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li  wrote:
>>
>>> +1 (binding) in Spark SQL, Core and PySpark.
>>>
>>> Xiao
>>>
>>> 2018-02-24 14:49 GMT-08:00 Ricardo Almeida >> >:
>>>
 +1 (non-binding)

 same as previous RC

 On 24 February 2018 at 11:10, Hyukjin Kwon  wrote:

> +1
>
> 2018-02-24 16:57 GMT+09:00 Bryan Cutler :
>
>> +1
>> Tests passed and additionally ran Arrow related tests and did some
>> perf checks with python 2.7.14
>>
>> On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau 
>> wrote:
>>
>>> Note: given the state of Jenkins I'd love to see Bryan Cutler or
>>> someone with Arrow experience sign off on this release.
>>>
>>> On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian 
>>> wrote:
>>>
 +1 (binding)

 Passed all the tests, looks good.

 Cheng

 On 2/23/18 15:00, Holden Karau wrote:

 +1 (binding)
 PySpark artifacts install in a fresh Py3 virtual env

 On Feb 23, 2018 7:55 AM, "Denny Lee"  wrote:

> +1 (non-binding)
>
> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
> joshgoldsboroughs...@gmail.com> wrote:
>
>> New to testing out Spark RCs for the community but I was able to
>> run some of the basic unit tests without error so for what it's 
>> worth, I'm
>> a +1.
>>
>> On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <
>> samee...@apache.org> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.0. The vote is open until Tuesday February 27, 2018 at 
>>> 8:00:00
>>> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.0-rc5:
>>> https://github.com/apache/spark/tree/v2.3.0-rc5
>>> (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>>>
>>> List of JIRA tickets resolved in this release can be found here:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>
>>> The release files, including signatures, digests, etc. can be
>>> found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapache
>>> spark-1266/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
>>> /_site/index.html
>>>
>>>
>>> FAQ
>>>
>>> ===
>>> What are the unresolved issues targeted for 2.3.0?
>>> ===
>>>
>>> Please see https://s.apache.org/oXKi. At the time of writing,
>>> there are currently no known release blockers.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by
>>> taking an existing Spark workload and running on this release 
>>> candidate,
>>> then reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and
>>> install the current RC and see if anything important breaks, in the
>>> Java/Scala you can add the staging repository to your projects 
>>> resolvers
>>> and test with the RC (make sure to clean up the artifact cache 
>>> before/after
>>> so you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.0?
>>> ===
>>>

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Michael Armbrust

I'm -1 on any changes that aren't fixing major regressions from 2.2 at this
point. Also in any cases where its possible we should be flipping new
features off if they are still regressing, rather than continuing to
attempt to fix them.

Since its experimental, I would support backporting the DataSourceV2
patches into 2.3.1 so that there is more opportunity for feedback as the
API matures.

On Wed, Feb 21, 2018 at 11:32 AM, Shixiong(Ryan) Zhu <
shixi...@databricks.com> wrote:

> FYI. I found two more blockers:
>
> https://issues.apache.org/jira/browse/SPARK-23475
> https://issues.apache.org/jira/browse/SPARK-23481
>
> On Wed, Feb 21, 2018 at 9:45 AM, Xiao Li  wrote:
>
>> Hi, Ryan,
>>
>> In this release, Data Source V2 is experimental. We are still collecting
>> the feedbacks from the community and will improve the related APIs and
>> implementation in the next 2.4 release.
>>
>> Thanks,
>>
>> Xiao
>>
>> 2018-02-21 9:43 GMT-08:00 Xiao Li :
>>
>>> Hi, Justin,
>>>
>>> Based on my understanding, SPARK-17147 is also not a regression. Thus,
>>> Spark 2.3.0 is unable to contain it. We have to wait for the committers who
>>> are familiar with Spark Streaming to make a decision whether we can fix the
>>> issue in Spark 2.3.1.
>>>
>>> Since this is open source, feel free to add the patch in your local
>>> build.
>>>
>>> Thanks for using Spark!
>>>
>>> Xiao
>>>
>>>
>>> 2018-02-21 9:36 GMT-08:00 Ryan Blue :
>>>
 No problem if we can't add them, this is experimental anyway so this
 release should be more about validating the API and the start of our
 implementation. I just don't think we can recommend that anyone actually
 use DataSourceV2 without these patches.

 On Wed, Feb 21, 2018 at 9:21 AM, Wenchen Fan 
 wrote:

> SPARK-23323 adds a new API, I'm not sure we can still do it at this
> stage of the release... Besides users can work around it by calling the
> spark output coordinator themselves in their data source.
>
> SPARK-23203 is non-trivial and didn't fix any known bugs, so it's hard
> to convince other people that it's safe to add it to the release during 
> the
> RC phase.
>
> SPARK-23418 depends on the above one.
>
> Generally they are good to have in Spark 2.3, if they were merged
> before the RC. I think this is a lesson we should learn from, that we
> should work on stuff we want in the release before the RC, instead of 
> after.
>
> On Thu, Feb 22, 2018 at 1:01 AM, Ryan Blue 
> wrote:
>
>> What does everyone think about getting some of the newer DataSourceV2
>> improvements in? It should be low risk because it is a new code path, and
>> v2 isn't very usable without things like support for using the output
>> commit coordinator to deconflict writes.
>>
>> The ones I'd like to get in are:
>> * Use the output commit coordinator: https://issues.ap
>> ache.org/jira/browse/SPARK-23323
>> * Use immutable trees and the same push-down logic as other read
>> paths: https://issues.apache.org/jira/browse/SPARK-23203
>> * Don't allow users to supply schemas when they aren't supported:
>> https://issues.apache.org/jira/browse/SPARK-23418
>>
>> I think it would make the 2.3.0 release more usable for anyone
>> interested in the v2 read and write paths.
>>
>> Thanks!
>>
>> On Tue, Feb 20, 2018 at 7:07 PM, Weichen Xu <
>> weichen...@databricks.com> wrote:
>>
>>> +1
>>>
>>> On Wed, Feb 21, 2018 at 10:07 AM, Marcelo Vanzin <
>>> van...@cloudera.com> wrote:
>>>
 Done, thanks!

 On Tue, Feb 20, 2018 at 6:05 PM, Sameer Agarwal <
 samee...@apache.org> wrote:
 > Sure, please feel free to backport.
 >
 > On 20 February 2018 at 18:02, Marcelo Vanzin 
 wrote:
 >>
 >> Hey Sameer,
 >>
 >> Mind including https://github.com/apache/spark/pull/20643
 >> (SPARK-23468)  in the new RC? It's a minor bug since I've only
 hit it
 >> with older shuffle services, but it's pretty safe.
 >>
 >> On Tue, Feb 20, 2018 at 5:58 PM, Sameer Agarwal <
 samee...@apache.org>
 >> wrote:
 >> > This RC has failed due to
 >> > https://issues.apache.org/jira/browse/SPARK-23470.
 >> > Now that the fix has been merged in 2.3 (thanks Marcelo!),
 I'll follow
 >> > up
 >> > with an RC5 soon.
 >> >
 >> > On 20 February 2018 at 16:49, Ryan Blue 
 wrote:
 >> >>
 >> >> +1
 >> >>
 >> >> Build & tests look fine, checked signature and checksums for
 src
 >> >> tarball.
 >> >>
 >> >> On Tue, Feb 20, 2018 at 12:54 PM,

Re: DataSourceV2: support for named tables

2018-02-02 Thread Michael Armbrust

I am definitely in favor of first-class / consistent support for tables and
data sources.

One thing that is not clear to me from this proposal is exactly what the
interfaces are between:
 - Spark
 - A (The?) metastore
 - A data source

If we pass in the table identifier is the data source then responsible for
talking directly to the metastore? Is that what we want? (I'm not sure)

On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue 
wrote:

> There are two main ways to load tables in Spark: by name (db.table) and by
> a path. Unfortunately, the integration for DataSourceV2 has no support for
> identifying tables by name.
>
> I propose supporting the use of TableIdentifier, which is the standard
> way to pass around table names.
>
> The reason I think we should do this is to easily support more ways of
> working with DataSourceV2 tables. SQL statements and parts of the
> DataFrameReader and DataFrameWriter APIs that use table names create
> UnresolvedRelation instances that wrap an unresolved TableIdentifier.
>
> By adding support for passing TableIdentifier to a DataSourceV2Relation,
> then about all we need to enable these code paths is to add a resolution
> rule. For that rule, we could easily identify a default data source that
> handles named tables.
>
> This is what we’re doing in our Spark build, and we have DataSourceV2
> tables working great through SQL. (Part of this depends on the logical plan
> changes from my previous email to ensure inserts are properly resolved.)
>
> In the long term, I think we should update how we parse tables so that
> TableIdentifier can contain a source in addition to a database/context
> and a table name. That would allow us to integration new sources fairly
> seamlessly, without needing to a rather redundant SQL create statement like
> this:
>
> CREATE TABLE database.name USING source OPTIONS (table 'database.name')
>
> Also, I think we should pass TableIdentifier to DataSourceV2Relation,
> rather than going with Wenchen’s suggestion that we pass the table name as
> a string property, “table”. My rationale is that the new API shouldn’t leak
> its internal details to other parts of the planner.
>
> If we were to convert TableIdentifer to a “table” property wherever
> DataSourceV2Relation is created, we create several places that need to be
> in sync with the same convention. On the other hand, passing
> TableIdentifier to DataSourceV2Relation and relying on the relation to
> correctly set the options passed to readers and writers minimizes the
> number of places that conversion needs to happen.
>
> rb
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-02 Thread Michael Armbrust

>
> So here are my recommendations for moving forward, with DataSourceV2 as a
> starting point:
>
>1. Use well-defined logical plan nodes for all high-level operations:
>insert, create, CTAS, overwrite table, etc.
>2. Use rules that match on these high-level plan nodes, so that it
>isn’t necessary to create rules to match each eventual code path
>individually
>3. Define Spark’s behavior for these logical plan nodes. Physical
>nodes should implement that behavior, but all CREATE TABLE OVERWRITE should
>(eventually) make the same guarantees.
>4. Specialize implementation when creating a physical plan, not
>logical plans.
>
> I realize this is really long, but I’d like to hear thoughts about this.
> I’m sure I’ve left out some additional context, but I think the main idea
> here is solid: lets standardize logical plans for more consistent behavior
> and easier maintenance.
>
Context aside, I really like these rules! I think having query planning be
the boundary for specialization makes a lot of sense.

(RunnableCommand might also be my fault though sorry! :P)

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust

-dev +user


> Similarly for structured streaming, Would there be any limit on number of
> of streaming sources I can have ?
>

There is no fundamental limit, but each stream will have a thread on the
driver that is doing coordination of execution.  We comfortably run 20+
streams on a single cluster in production, but I have not pushed the
limits.  You'd want to test with your specific application.

Re: Spark error while trying to spark.read.json()

2017-12-19 Thread Michael Armbrust

- dev

java.lang.AbstractMethodError almost always means that you have different
libraries on the classpath than at compilation time.  In this case I would
check to make sure you have the correct version of Scala (and only have one
version of scala) on the classpath.

On Tue, Dec 19, 2017 at 5:42 PM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

> Hi All,
>
> Can anyone help me with below error,
>
> Exception in thread "main" java.lang.AbstractMethodError
> at scala.collection.TraversableLike$class.filterNot(TraversableLike.
> scala:278)
> at org.apache.spark.sql.types.StructType.filterNot(StructType.scala:98)
> at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:386)
> at org.spark.jsonDF.StructStreamKafkaToDF$.getValueSchema(
> StructStreamKafkaToDF.scala:22)
> at org.spark.jsonDF.StructStreaming$.createRowDF(StructStreaming.scala:21)
> at SparkEntry$.delayedEndpoint$SparkEntry$1(SparkEntry.scala:22)
> at SparkEntry$delayedInit$body.apply(SparkEntry.scala:7)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
> at scala.runtime.AbstractFunction0.apply$mcV$
> sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.generic.TraversableForwarder$class.
> foreach(TraversableForwarder.scala:35)
> at scala.App$class.main(App.scala:76)
> at SparkEntry$.main(SparkEntry.scala:7)
> at SparkEntry.main(SparkEntry.scala)
>
> This happening, when i try to pass Dataset[String] containing jsons to
> spark.read.json(Records).
>
> Regards,
> Satyajit.
>

Re: Timeline for Spark 2.3

2017-12-19 Thread Michael Armbrust

Do people really need to be around for the branch cut (modulo the person
cutting the branch)?

1st or 2nd doesn't really matter to me, but I am +1 kicking this off as
soon as we enter the new year :)

Michael

On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau  wrote:

> Sounds reasonable, although I'd choose the 2nd perhaps just since lots of
> folks are off on the 1st?
>
> On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal 
> wrote:
>
>> Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that
>> (i.e., week of 8th Jan)?
>>
>>
>> On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau 
>> wrote:
>>
>>> So personally I’d be in favour or pushing to early January, doing a
>>> release over the holidays is a little rough with herding all of people to
>>> vote.
>>>
>>> On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson 
>>> wrote:
>>>
 I wanted to check in on the state of the 2.3 freeze schedule.  Original
 proposal was "late Dec", which is a bit open to interpretation.

 We are working to get some refactoring done on the integration testing
 for the Kubernetes back-end in preparation for testing upcoming release
 candidates, however holiday vacation time is about to begin taking its toll
 both on upstream reviewing and on the "downstream" spark-on-kube fork.

 If the freeze pushed into January, that would take some of the pressure
 off the kube back-end upstreaming. However, regardless, I was wondering if
 the dates could be clarified.
 Cheers,
 Erik


 On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com 
 wrote:

> Hi,
>
> What is the process to request an issue/fix to be included in the next
> release? Is there a place to vote for features?
> I am interested in https://issues.apache.org/jira/browse/SPARK-13127,
> to see
> if we can get Spark upgrade parquet to 1.9.0, which addresses the
> https://issues.apache.org/jira/browse/PARQUET-686.
> Can we include the fix in Spark 2.3 release?
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Sameer Agarwal
>> Software Engineer | Databricks Inc.
>> http://cs.berkeley.edu/~sameerag
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Re: queryable state & streaming

2017-12-08 Thread Michael Armbrust

https://issues.apache.org/jira/browse/SPARK-16738

I don't believe anyone is working on it yet.  I think the most useful thing
is to start enumerating requirements and use cases and then we can talk
about how to build it.

On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

> Cool Burak do you have a pointer, should I take the initiative for a first
> design document or Databricks is working on it?
>
> Best,
> Stavros
>
> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz  wrote:
>
>> Hi Stavros,
>>
>> Queryable state is definitely on the roadmap! We will revamp the
>> StateStore API a bit, and a queryable StateStore is definitely one of the
>> things we are thinking about during that revamp.
>>
>> Best,
>> Burak
>>
>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" 
>> wrote:
>>
>>> Just to re-phrase my question: Would query-able state make a viable
>>> SPIP?
>>>
>>> Regards,
>>> Stavros
>>>
>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>>> st.kontopou...@gmail.com> wrote:
>>>
 Hi,

 Maybe this has been discussed before. Given the fact that many
 streaming apps out there use state extensively, could be a good idea to
 make Spark expose streaming state with an external API like other
 systems do (Kafka streams, Flink etc), in order to facilitate
 interactive queries?

 Regards,
 Stavros

>>>
>>>
>

Timeline for Spark 2.3

2017-11-09 Thread Michael Armbrust

According to the timeline posted on the website, we are nearing branch cut
for Spark 2.3.  I'd like to propose pushing this out towards mid to late
December for a couple of reasons and would like to hear what people think.

1. I've done release management during the Thanksgiving / Christmas time
before and in my experience, we don't actually get a lot of testing during
this time due to vacations and other commitments. I think beginning the RC
process in early January would give us the best coverage in the shortest
amount of time.
2. There are several large initiatives in progress that given a little more
time would leave us with a much more exciting 2.3 release. Specifically,
the work on the history server, Kubernetes and continuous processing.
3. Given the actual release date of Spark 2.2, I think we'll still get
Spark 2.3 out roughly 6 months after.

Thoughts?

Michael

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Michael Armbrust

+1

On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li  wrote:

> +1
>
> 2017-11-04 11:00 GMT-07:00 Burak Yavuz :
>
>> +1
>>
>> On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu 
>>> wrote:
>>>
 +1.

 On Sat, Nov 4, 2017 at 8:04 AM, Matei Zaharia 
 wrote:

> +1 from me too.
>
> Matei
>
> > On Nov 3, 2017, at 4:59 PM, Wenchen Fan  wrote:
> >
> > +1.
> >
> > I think this architecture makes a lot of sense to let executors talk
> to source/sink directly, and bring very low latency.
> >
> > On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen 
> wrote:
> > +0 simply because I don't feel I know enough to have an opinion. I
> have no reason to doubt the change though, from a skim through the doc.
> >
> >
> > On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin 
> wrote:
> > Earlier I sent out a discussion thread for CP in Structured
> Streaming:
> >
> > https://issues.apache.org/jira/browse/SPARK-20928
> >
> > It is meant to be a very small, surgical change to Structured
> Streaming to enable ultra-low latency. This is great timing because we are
> also designing and implementing data source API v2. If designed properly,
> we can have the same data source API working for both streaming and batch.
> >
> >
> > Following the SPIP process, I'm putting this SPIP up for a vote.
> >
> > +1: Let's go ahead and design / implement the SPIP.
> > +0: Don't really care.
> > -1: I do not think this is a good idea for the following reasons.
> >
> >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783 <(224)%20436-0783>
>>> Greater Chicago
>>>
>>
>>
>

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust

- dev

I think you should be able to write an Aggregator
.
You probably want to run in update mode if you are looking for it to output
any group that has changed in the batch.

On Wed, Oct 25, 2017 at 5:52 PM, Piyush Mukati 
wrote:

> Hi,
> we are migrating some jobs from Dstream to Structured Stream.
>
> Currently to handle aggregations we call map and reducebyKey on each RDD
> like
> rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b))
>
> The final output of each RDD is merged to the sink with support for
> aggregation at the sink( Like co-processor at HBase ).
>
> In the new DataSet API, I am not finding any suitable API to aggregate
> over the micro-batch.
> Most of the aggregation API uses state-store and provide global
> aggregations. ( with append mode it does not give the change in existing
> buckets )
> Problems we are suspecting are :
>  1) state-store is tightly linked to the job definitions. while in our
> case we want may edit the job while keeping the older calculated aggregate
> as it is.
>
> The desired result can be achieved with below dataset APIs.
> dataset.groupByKey(a=>a._1).mapGroups( (key, valueItr) => merge(valueItr))
> while on observing the physical plan it does not call any merge before
> sort.
>
>  Anyone aware of API or other workarounds to get the desired result?
>

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-14 Thread Michael Armbrust

Yep, that is correct.  You can also use the query ID which is a GUID that
is stored in the checkpoint and preserved across restarts if you want to
distinguish the batches from different streams.

sqlContext.sparkContext.getLocalProperty(StreamExecution.QUERY_ID_KEY)

This was added recently
<https://github.com/apache/spark/commit/2d968a07d211688a9c588deb859667dd8b653b27>
though.

On Thu, Sep 14, 2017 at 3:40 AM, Dmitry Naumenko <dm.naume...@gmail.com>
wrote:

> Ok. So since I can get repeated batch ids, I guess I can just store the
> last committed batch id in my storage (in the same transaction with the
> data) and initialize the custom sink with right batch id when application
> re-starts. After this just ignore batch if current batchId <=
> latestBatchId.
>
> Dmitry
>
>
> 2017-09-13 22:12 GMT+03:00 Michael Armbrust <mich...@databricks.com>:
>
>> I think the right way to look at this is the batchId is just a proxy for
>> offsets that is agnostic to what type of source you are reading from (or
>> how many sources their are).  We might call into a custom sink with the
>> same batchId more than once, but it will always contain the same data
>> (there is no race condition, since this is stored in a write-ahead log).
>> As long as you check/commit the batch id in the same transaction as the
>> data you will get exactly once.
>>
>> On Wed, Sep 13, 2017 at 1:25 AM, Dmitry Naumenko <dm.naume...@gmail.com>
>> wrote:
>>
>>> Thanks, I see.
>>>
>>> However, I guess reading from checkpoint directory might be less
>>> efficient comparing just preserving offsets in Dataset.
>>>
>>> I have one more question about operation idempotence (hope it help
>>> others to have a clear picture).
>>>
>>> If I read offsets on re-start from RDBMS and manually specify starting
>>> offsets on Kafka Source, is it still possible that in case of any failure I
>>> got a situation where the duplicate batch id will go to a Custom Sink?
>>>
>>> Previously on DStream, you will just read offsets from storage on start
>>> and just write them into DB in one transaction with data and it's was
>>> enough for "exactly-once". Please, correct me if I made a mistake here. So
>>> does the same strategy will work with Structured Streaming?
>>>
>>> I guess, that in case of Structured Streaming, Spark will commit batch
>>> offset to a checkpoint directory and there can be a race condition where
>>> you can commit your data with offsets into DB, but Spark will fail to
>>> commit the batch id, and some kind of automatic retry happen. If this is
>>> true, is it possible to disable this automatic re-try, so I can still use
>>> unified API for batch/streaming with my own re-try logic (which is
>>> basically, just ignore intermediate data, re-read from Kafka and re-try
>>> processing and load)?
>>>
>>> Dmitry
>>>
>>>
>>> 2017-09-12 22:43 GMT+03:00 Michael Armbrust <mich...@databricks.com>:
>>>
>>>> In the checkpoint directory there is a file /offsets/$batchId that
>>>> holds the offsets serialized as JSON.  I would not consider this a public
>>>> stable API though.
>>>>
>>>> Really the only important thing to get exactly once is that you must
>>>> ensure whatever operation you are doing downstream is idempotent with
>>>> respect to the batchId.  For example, if you are writing to an RDBMS you
>>>> could have a table that records the batch ID and update that in the same
>>>> transaction as you append the results of the batch.  Before trying to
>>>> append you should check that batch ID and make sure you have not already
>>>> committed.
>>>>
>>>> On Tue, Sep 12, 2017 at 11:48 AM, Dmitry Naumenko <
>>>> dm.naume...@gmail.com> wrote:
>>>>
>>>>> Thanks for response, Michael
>>>>>
>>>>> >  You should still be able to get exactly once processing by using
>>>>> the batchId that is passed to the Sink.
>>>>>
>>>>> Could you explain this in more detail, please? Is there some kind of
>>>>> offset manager API that works as get-offset by batch id lookup table?
>>>>>
>>>>> Dmitry
>>>>>
>>>>> 2017-09-12 20:29 GMT+03:00 Michael Armbrust <mich...@databricks.com>:
>>>>>
>>>>>> I think that we are going to have to change the Sink API as part of
>>>>>> SPARK-20

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-13 Thread Michael Armbrust

I think the right way to look at this is the batchId is just a proxy for
offsets that is agnostic to what type of source you are reading from (or
how many sources their are).  We might call into a custom sink with the
same batchId more than once, but it will always contain the same data
(there is no race condition, since this is stored in a write-ahead log).
As long as you check/commit the batch id in the same transaction as the
data you will get exactly once.

On Wed, Sep 13, 2017 at 1:25 AM, Dmitry Naumenko <dm.naume...@gmail.com>
wrote:

> Thanks, I see.
>
> However, I guess reading from checkpoint directory might be less efficient
> comparing just preserving offsets in Dataset.
>
> I have one more question about operation idempotence (hope it help others
> to have a clear picture).
>
> If I read offsets on re-start from RDBMS and manually specify starting
> offsets on Kafka Source, is it still possible that in case of any failure I
> got a situation where the duplicate batch id will go to a Custom Sink?
>
> Previously on DStream, you will just read offsets from storage on start
> and just write them into DB in one transaction with data and it's was
> enough for "exactly-once". Please, correct me if I made a mistake here. So
> does the same strategy will work with Structured Streaming?
>
> I guess, that in case of Structured Streaming, Spark will commit batch
> offset to a checkpoint directory and there can be a race condition where
> you can commit your data with offsets into DB, but Spark will fail to
> commit the batch id, and some kind of automatic retry happen. If this is
> true, is it possible to disable this automatic re-try, so I can still use
> unified API for batch/streaming with my own re-try logic (which is
> basically, just ignore intermediate data, re-read from Kafka and re-try
> processing and load)?
>
> Dmitry
>
>
> 2017-09-12 22:43 GMT+03:00 Michael Armbrust <mich...@databricks.com>:
>
>> In the checkpoint directory there is a file /offsets/$batchId that holds
>> the offsets serialized as JSON.  I would not consider this a public stable
>> API though.
>>
>> Really the only important thing to get exactly once is that you must
>> ensure whatever operation you are doing downstream is idempotent with
>> respect to the batchId.  For example, if you are writing to an RDBMS you
>> could have a table that records the batch ID and update that in the same
>> transaction as you append the results of the batch.  Before trying to
>> append you should check that batch ID and make sure you have not already
>> committed.
>>
>> On Tue, Sep 12, 2017 at 11:48 AM, Dmitry Naumenko <dm.naume...@gmail.com>
>> wrote:
>>
>>> Thanks for response, Michael
>>>
>>> >  You should still be able to get exactly once processing by using the 
>>> > batchId
>>> that is passed to the Sink.
>>>
>>> Could you explain this in more detail, please? Is there some kind of
>>> offset manager API that works as get-offset by batch id lookup table?
>>>
>>> Dmitry
>>>
>>> 2017-09-12 20:29 GMT+03:00 Michael Armbrust <mich...@databricks.com>:
>>>
>>>> I think that we are going to have to change the Sink API as part of
>>>> SPARK-20928 <https://issues-test.apache.org/jira/browse/SPARK-20928>,
>>>> which is why I linked these tickets together.  I'm still targeting an
>>>> initial version for Spark 2.3 which should happen sometime towards the end
>>>> of the year.
>>>>
>>>> There are some misconceptions in that stack overflow answer that I can
>>>> correct.  Until we improve the Source API, You should still be able to get
>>>> exactly once processing by using the batchId that is passed to the Sink.
>>>> We guarantee that the offsets present at any given batch ID will be the
>>>> same across retries by recording this information in the checkpoint's WAL.
>>>> The checkpoint does not use java serialization (like DStreams does) and can
>>>> be used even after upgrading Spark.
>>>>
>>>>
>>>> On Tue, Sep 12, 2017 at 12:45 AM, Dmitry Naumenko <
>>>> dm.naume...@gmail.com> wrote:
>>>>
>>>>> Thanks, Cody
>>>>>
>>>>> Unfortunately, it seems to be there is no active development right
>>>>> now. Maybe I can step in and help with it somehow?
>>>>>
>>>>> Dmitry
>>>>>
>>>>> 2017-09-11 21:01 GMT+03:00 Cody Koeninger <c...@koeninger.org>:
>>>>>
>>>&g

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust

In the checkpoint directory there is a file /offsets/$batchId that holds
the offsets serialized as JSON.  I would not consider this a public stable
API though.

Really the only important thing to get exactly once is that you must ensure
whatever operation you are doing downstream is idempotent with respect to
the batchId.  For example, if you are writing to an RDBMS you could have a
table that records the batch ID and update that in the same transaction as
you append the results of the batch.  Before trying to append you should
check that batch ID and make sure you have not already committed.

On Tue, Sep 12, 2017 at 11:48 AM, Dmitry Naumenko <dm.naume...@gmail.com>
wrote:

> Thanks for response, Michael
>
> >  You should still be able to get exactly once processing by using the 
> > batchId
> that is passed to the Sink.
>
> Could you explain this in more detail, please? Is there some kind of
> offset manager API that works as get-offset by batch id lookup table?
>
> Dmitry
>
> 2017-09-12 20:29 GMT+03:00 Michael Armbrust <mich...@databricks.com>:
>
>> I think that we are going to have to change the Sink API as part of
>> SPARK-20928 <https://issues-test.apache.org/jira/browse/SPARK-20928>,
>> which is why I linked these tickets together.  I'm still targeting an
>> initial version for Spark 2.3 which should happen sometime towards the end
>> of the year.
>>
>> There are some misconceptions in that stack overflow answer that I can
>> correct.  Until we improve the Source API, You should still be able to get
>> exactly once processing by using the batchId that is passed to the Sink.
>> We guarantee that the offsets present at any given batch ID will be the
>> same across retries by recording this information in the checkpoint's WAL.
>> The checkpoint does not use java serialization (like DStreams does) and can
>> be used even after upgrading Spark.
>>
>>
>> On Tue, Sep 12, 2017 at 12:45 AM, Dmitry Naumenko <dm.naume...@gmail.com>
>> wrote:
>>
>>> Thanks, Cody
>>>
>>> Unfortunately, it seems to be there is no active development right now.
>>> Maybe I can step in and help with it somehow?
>>>
>>> Dmitry
>>>
>>> 2017-09-11 21:01 GMT+03:00 Cody Koeninger <c...@koeninger.org>:
>>>
>>>> https://issues-test.apache.org/jira/browse/SPARK-18258
>>>>
>>>> On Mon, Sep 11, 2017 at 7:15 AM, Dmitry Naumenko <dm.naume...@gmail.com>
>>>> wrote:
>>>> > Hi all,
>>>> >
>>>> > It started as a discussion in
>>>> > https://stackoverflow.com/questions/46153105/how-to-get-kafk
>>>> a-offsets-with-spark-structured-streaming-api.
>>>> >
>>>> > So the problem that there is no support in Public API to obtain the
>>>> Kafka
>>>> > (or Kineses) offsets. For example, if you want to save offsets in
>>>> external
>>>> > storage in Custom Sink, you should :
>>>> > 1) preserve topic, partition and offset across all transform
>>>> operations of
>>>> > Dataset (based on hard-coded Kafka schema)
>>>> > 2) make a manual group by partition/offset with aggregate max offset
>>>> >
>>>> > Structured Streaming doc says "Every streaming source is assumed to
>>>> have
>>>> > offsets", so why it's not a part of Public API? What do you think
>>>> about
>>>> > supporting it?
>>>> >
>>>> > Dmitry
>>>>
>>>
>>>
>>
>

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust

I think that we are going to have to change the Sink API as part of
SPARK-20928 , which
is why I linked these tickets together.  I'm still targeting an initial
version for Spark 2.3 which should happen sometime towards the end of the
year.

There are some misconceptions in that stack overflow answer that I can
correct.  Until we improve the Source API, You should still be able to get
exactly once processing by using the batchId that is passed to the Sink. We
guarantee that the offsets present at any given batch ID will be the same
across retries by recording this information in the checkpoint's WAL. The
checkpoint does not use java serialization (like DStreams does) and can be
used even after upgrading Spark.

On Tue, Sep 12, 2017 at 12:45 AM, Dmitry Naumenko 
wrote:

> Thanks, Cody
>
> Unfortunately, it seems to be there is no active development right now.
> Maybe I can step in and help with it somehow?
>
> Dmitry
>
> 2017-09-11 21:01 GMT+03:00 Cody Koeninger :
>
>> https://issues-test.apache.org/jira/browse/SPARK-18258
>>
>> On Mon, Sep 11, 2017 at 7:15 AM, Dmitry Naumenko 
>> wrote:
>> > Hi all,
>> >
>> > It started as a discussion in
>> > https://stackoverflow.com/questions/46153105/how-to-get-kafk
>> a-offsets-with-spark-structured-streaming-api.
>> >
>> > So the problem that there is no support in Public API to obtain the
>> Kafka
>> > (or Kineses) offsets. For example, if you want to save offsets in
>> external
>> > storage in Custom Sink, you should :
>> > 1) preserve topic, partition and offset across all transform operations
>> of
>> > Dataset (based on hard-coded Kafka schema)
>> > 2) make a manual group by partition/offset with aggregate max offset
>> >
>> > Structured Streaming doc says "Every streaming source is assumed to have
>> > offsets", so why it's not a part of Public API? What do you think about
>> > supporting it?
>> >
>> > Dmitry
>>
>
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Michael Armbrust

+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue  wrote:

> +1 (non-binding)
>
> Thanks for making the updates reflected in the current PR. It would be
> great to see the doc updated before it is finally published though.
>
> Right now it feels like this SPIP is focused more on getting the basics
> right for what many datasources are already doing in API V1 combined with
> other private APIs, vs pushing forward state of the art for performance.
>
> I think that’s the right approach for this SPIP. We can add the support
> you’re talking about later with a more specific plan that doesn’t block
> fixing the problems that this addresses.
> 
>
> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1 (binding)
>>
>> I personally believe that there is quite a big difference between having
>> a generic data source interface with a low surface area and pushing down a
>> significant part of query processing into a datasource. The later has much
>> wider wider surface area and will require us to stabilize most of the
>> internal catalyst API's which will be a significant burden on the community
>> to maintain and has the potential to slow development velocity
>> significantly. If you want to write such integrations then you should be
>> prepared to work with catalyst internals and own up to the fact that things
>> might change across minor versions (and in some cases even maintenance
>> releases). If you are willing to go down that road, then your best bet is
>> to use the already existing spark session extensions which will allow you
>> to write such integrations and can be used as an `escape hatch`.
>>
>>
>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash  wrote:
>>
>>> +0 (non-binding)
>>>
>>> I think there are benefits to unifying all the Spark-internal
>>> datasources into a common public API for sure.  It will serve as a forcing
>>> function to ensure that those internal datasources aren't advantaged vs
>>> datasources developed externally as plugins to Spark, and that all Spark
>>> features are available to all datasources.
>>>
>>> But I also think this read-path proposal avoids the more difficult
>>> questions around how to continue pushing datasource performance forwards.
>>> James Baker (my colleague) had a number of questions about advanced
>>> pushdowns (combined sorting and filtering), and Reynold also noted that
>>> pushdown of aggregates and joins are desirable on longer timeframes as
>>> well.  The Spark community saw similar requests, for aggregate pushdown in
>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>>> performance work for datasources.
>>>
>>> To leave enough space for datasource developers to continue
>>> experimenting with advanced interactions between Spark and their
>>> datasources, I'd propose we leave some sort of escape valve that enables
>>> these datasources to keep pushing the boundaries without forking Spark.
>>> Possibly that looks like an additional unsupported/unstable interface that
>>> pushes down an entire (unstable API) logical plan, which is expected to
>>> break API on every release.   (Spark attempts this full-plan pushdown, and
>>> if that fails Spark ignores it and continues on with the rest of the V2 API
>>> for compatibility).  Or maybe it looks like something else that we don't
>>> know of yet.  Possibly this falls outside of the desired goals for the V2
>>> API and instead should be a separate SPIP.
>>>
>>> If we had a plan for this kind of escape valve for advanced datasource
>>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>>> focused more on getting the basics right for what many datasources are
>>> already doing in API V1 combined with other private APIs, vs pushing
>>> forward state of the art for performance.
>>>
>>> Andrew
>>>
>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>>> suresh.thalam...@gmail.com> wrote:
>>>
 +1 (non-binding)


 On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:

 Hi all,

 In the previous discussion, we decided to split the read and write path
 of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
 for Data Source V2 read path only.

 The full document of the Data Source API V2 is:
 https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
 -Z8qU5Frf6WMQZ6jJVM/edit

 The ready-for-review PR that implements the basic infrastructure for
 the read path is:
 https://github.com/apache/spark/pull/19136

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

Re: Increase Timeout or optimize Spark UT?

2017-08-23 Thread Michael Armbrust

I think we already set the number of partitions to 5 in tests

?

On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz 
wrote:

> Hi,
>
> From my experience it is possible to cut quite a lot by reducing
> spark.sql.shuffle.partitions to some reasonable value (let's say
> comparable to the number of cores). 200 is a serious overkill for most of
> the test cases anyway.
>
>
> Best,
> Maciej
>
>
>
> On 21 August 2017 at 03:00, Dong Joon Hyun  wrote:
>
>> +1 for any efforts to recover Jenkins!
>>
>>
>>
>> Thank you for the direction.
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>> *From: *Reynold Xin 
>> *Date: *Sunday, August 20, 2017 at 5:53 PM
>> *To: *Dong Joon Hyun 
>> *Cc: *"dev@spark.apache.org" 
>> *Subject: *Re: Increase Timeout or optimize Spark UT?
>>
>>
>>
>> It seems like it's time to look into how to cut down some of the test
>> runtimes. Test runtimes will slowly go up given the way development
>> happens. 3 hr is already a very long time for tests to run.
>>
>>
>>
>>
>>
>> On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun 
>> wrote:
>>
>> Hi, All.
>>
>>
>>
>> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6) has
>> been hitting the build timeout.
>>
>>
>>
>> Please see the build time trend.
>>
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>> t%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>>
>>
>>
>> All recent 22 builds fail due to timeout directly/indirectly. The last
>> success (SBT with Hadoop-2.7) is 15th August.
>>
>>
>>
>> We may do the followings.
>>
>>
>>
>>1. Increase Build Timeout (3 hr 30 min)
>>2. Optimize UTs (Scala/Java/Python/UT)
>>
>>
>>
>> But, Option 1 will be the immediate solution for now . Could you update
>> the Jenkins setup?
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>
>

Re: [SS] watermark, eventTime and "StreamExecution: Streaming query made progress"

2017-08-11 Thread Michael Armbrust

The point here is to tell you what watermark value was used when executing
this batch.  You don't know the new watermark until the batch is over and
we don't want to do two passes over the data.  In general the semantics of
the watermark are designed to be conservative (i.e. just because data is
older than the watermark does not mean it will be dropped, but data will
never be dropped until after it is below the watermark).

On Fri, Aug 11, 2017 at 12:23 AM, Jacek Laskowski  wrote:

> Hi,
>
> I'm curious why watermark is updated the next streaming batch after
> it's been observed [1]? The report (from
> ProgressReporter/StreamExecution) does not look right to me as
> avg/max/min are already calculated according to the watermark [2]
>
> My recommendation would be to do the update [2] in the same streaming
> batch it was observed. Why not? Please enlighten.
>
> 17/08/11 09:04:20 INFO StreamExecution: Streaming query made progress: {
>   "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
>   "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
>   "name" : "rates-to-console",
>   "timestamp" : "2017-08-11T07:04:20.004Z",
>   "batchId" : 1,
>   "numInputRows" : 2,
>   "inputRowsPerSecond" : 0.7601672367920943,
>   "processedRowsPerSecond" : 25.31645569620253,
>   "durationMs" : {
> "addBatch" : 48,
> "getBatch" : 6,
> "getOffset" : 0,
> "queryPlanning" : 1,
> "triggerExecution" : 79,
> "walCommit" : 23
>   },
>   "eventTime" : {
> "avg" : "2017-08-11T07:04:17.782Z",
> "max" : "2017-08-11T07:04:18.282Z",
> "min" : "2017-08-11T07:04:17.282Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   },
>
> ...
>
> 17/08/11 09:04:30 INFO StreamExecution: Streaming query made progress: {
>   "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
>   "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
>   "name" : "rates-to-console",
>   "timestamp" : "2017-08-11T07:04:30.003Z",
>   "batchId" : 2,
>   "numInputRows" : 10,
>   "inputRowsPerSecond" : 1.000100010001,
>   "processedRowsPerSecond" : 56.17977528089888,
>   "durationMs" : {
> "addBatch" : 147,
> "getBatch" : 6,
> "getOffset" : 0,
> "queryPlanning" : 1,
> "triggerExecution" : 178,
> "walCommit" : 22
>   },
>   "eventTime" : {
> "avg" : "2017-08-11T07:04:23.782Z",
> "max" : "2017-08-11T07:04:28.282Z",
> "min" : "2017-08-11T07:04:19.282Z",
> "watermark" : "2017-08-11T07:04:08.282Z"
>   },
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/streaming/
> StreamExecution.scala?utf8=%E2%9C%93#L538
> [2] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/streaming/
> ProgressReporter.scala#L257
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Thoughts on release cadence?

2017-07-31 Thread Michael Armbrust

+1, should we update https://spark.apache.org/versioning-policy.html ?

On Sun, Jul 30, 2017 at 3:34 PM, Reynold Xin  wrote:

> This is reasonable ... +1
>
>
> On Sun, Jul 30, 2017 at 2:19 AM, Sean Owen  wrote:
>
>> The project had traditionally posted some guidance about upcoming
>> releases. The last release cycle was about 6 months. What about penciling
>> in December 2017 for 2.3.0? http://spark.apache.org
>> /versioning-policy.html
>>
>
>

[ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Michael Armbrust

Hi all,

Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release
removes the experimental tag from Structured Streaming. In addition, this
release focuses on usability, stability, and polish, resolving over 1100
tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.2.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes: https://spark.apache.
org/releases/spark-release-2-2-0.html

*(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list) *

Michael

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-07 Thread Michael Armbrust

This vote passes! I'll followup with the release on Monday.

+1:
Michael Armbrust (binding)
Kazuaki Ishizaki
Sean Owen (binding)
Joseph Bradley (binding)
Ricardo Almeida
Herman van Hövell tot Westerflier (binding)
Yanbo Liang
Nick Pentreath (binding)
Wenchen Fan (binding)
Sameer Agarwal
Denny Lee
Felix Cheung
Holden Karau
Dong Joon Hyun
Reynold Xin (binding)
Hyukjin Kwon
Yin Huai (binding)
Xiao Li

-1: None

On Fri, Jul 7, 2017 at 12:21 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> +1
>
> Xiao Li
>
> 2017-07-06 22:18 GMT-07:00 Yin Huai <yh...@databricks.com>:
>
>> +1
>>
>> On Thu, Jul 6, 2017 at 8:40 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> +1
>>>
>>> 2017-07-07 6:41 GMT+09:00 Reynold Xin <r...@databricks.com>:
>>>
>>>> +1
>>>>
>>>>
>>>> On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.2.0-rc6
>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc6> (a2c7b2133cfee7f
>>>>> a9abfaa2bfbfb637155466783)
>>>>>
>>>>> List of JIRA tickets resolved can be found with this filter
>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>>>>> .
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>> spark-1245/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://people.apache.org/~pwendell/spark-releases/spark-2.2
>>>>> .0-rc6-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>
>>>>> *But my bug isn't fixed!??!*
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-06-30 Thread Michael Armbrust

I'll kick off the vote with a +1.

On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc6
> <https://github.com/apache/spark/tree/v2.2.0-rc6> (a2c7b2133cfee7f
> a9abfaa2bfbfb637155466783)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1245/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>

[VOTE] Apache Spark 2.2.0 (RC6)

2017-06-30 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.2.0-rc6
 (
a2c7b2133cfee7fa9abfaa2bfbfb637155466783)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1245/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-26 Thread Michael Armbrust

Okay, this vote fails.  Following with RC6 shortly.

On Wed, Jun 21, 2017 at 12:51 PM, Imran Rashid <iras...@cloudera.com> wrote:

> -1
>
> I'm sorry for discovering this so late, but I just filed
> https://issues.apache.org/jira/browse/SPARK-21165 which I think should be
> a blocker, its a regression from 2.1
>
> On Wed, Jun 21, 2017 at 1:43 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> As before, release looks good, all Scala, Python tests pass. R tests fail
>> with same issue in SPARK-21093 but it's not a blocker.
>>
>> +1 (binding)
>>
>>
>> On Wed, 21 Jun 2017 at 01:49 Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> I will kick off the voting with a +1.
>>>
>>> On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00
>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.2.0-rc5
>>>> <https://github.com/apache/spark/tree/v2.2.0-rc5> (62e442e73a2fa66
>>>> 3892d2edaff5f7d72d7f402ed)
>>>>
>>>> List of JIRA tickets resolved can be found with this filter
>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>>>> .
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1243/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://people.apache.org/~pwendell/spark-releases/spark-2.
>>>> 2.0-rc5-docs/
>>>>
>>>>
>>>> *FAQ*
>>>>
>>>> *How can I help test this release?*
>>>>
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>
>>>> *But my bug isn't fixed!??!*
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from 2.1.1.
>>>>
>>>
>>>
>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Michael Armbrust

rk/tree/v2.2.0-rc4,
>>>>>>
>>>>>> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
>>>>>> https://ci.appveyor.com/project/spark-test/spark/
>>>>>> build/755-r-test-v2.2.0-rc4)
>>>>>> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
>>>>>> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>>>>>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>>> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (https://gist.github.com/
>>>>>> HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>>>>>>
>>>>>>
>>>>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>>>>
>>>>>> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (https://gist.github.com/
>>>>>> HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>>>>>>
>>>>>>
>>>>>> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my
>>>>>> tests and observations.
>>>>>>
>>>>>> This is failed in Spark 2.1.1. So, it sounds not a regression
>>>>>> although it is a bug that should be fixed (whether in Spark or R).
>>>>>>
>>>>>>
>>>>>> 2017-06-14 8:28 GMT+09:00 Xiao Li <gatorsm...@gmail.com>:
>>>>>>
>>>>>>> -1
>>>>>>>
>>>>>>> Spark 2.2 is unable to read the partitioned table created by Spark
>>>>>>> 2.1 or earlier.
>>>>>>>
>>>>>>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>>>>>>
>>>>>>> Will fix it soon.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Xiao Li
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley <jos...@databricks.com>:
>>>>>>>
>>>>>>>> Re: the QA JIRAs:
>>>>>>>> Thanks for discussing them.  I still feel they are very helpful; I
>>>>>>>> particularly notice not having to spend a solid 2-3 weeks of time QAing
>>>>>>>> (unlike in earlier Spark releases).  One other point not mentioned 
>>>>>>>> above: I
>>>>>>>> think they serve as a very helpful reminder/training for the community 
>>>>>>>> for
>>>>>>>> rigor in development.  Since we instituted QA JIRAs, contributors have 
>>>>>>>> been
>>>>>>>> a lot better about adding in docs early, rather than waiting until the 
>>>>>>>> end
>>>>>>>> of the cycle (though I know this is drawing conclusions from 
>>>>>>>> correlations).
>>>>>>>>
>>>>>>>> I would vote in favor of the RC...but I'll wait to see about the
>>>>>>>> reported failures.
>>>>>>>>
>>>>>>>> On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen <so...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Different errors as in https://issues.apache.org/
>>>>>>>>> jira/browse/SPARK-20520 but that's also reporting R test
>>>>>>>>> failures.
>>>>>>>>>
>>>>>>>>> I went back and tried to run the R tests and they passed, at least
>>>>>>>>> on Ubuntu 17 / R 3.3.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath <
>>>>>>>>> nick.pentre...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> All Scala, Python tests pass. ML QA and doc issues are resolved
>>>>>>>>>> (as well as R it seems).
>>>>>>>>>>
>>>>>>>>>> However, I'm seeing the following test failure on R consistently:
>>>>>>>>>> https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 8 Jun 2017 at 08:48 Denny Lee <denny.g@gmail.com>
>>>>>>>>>&

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-20 Thread Michael Armbrust

I will kick off the voting with a +1.

On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc5
> <https://github.com/apache/spark/tree/v2.2.0-rc5> (62e442e73a2fa66
> 3892d2edaff5f7d72d7f402ed)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1243/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>

[VOTE] Apache Spark 2.2.0 (RC5)

2017-06-20 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.2.0-rc5
 (
62e442e73a2fa663892d2edaff5f7d72d7f402ed)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1243/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

Re: cannot call explain or show on dataframe in structured streaming addBatch dataframe

2017-06-19 Thread Michael Armbrust

There is a little bit of weirdness to how we override the default query
planner to replace it with an incrementalizing planner.  As such, calling
any operation that changes the query plan (such as a LIMIT) would cause it
to revert to the batch planner and return the wrong answer.  We should fix
this before the finalize the Sink API.

On Mon, Jun 19, 2017 at 9:32 AM, assaf.mendelson 
wrote:

> Hi all,
>
> I am playing around with structured streaming and looked at the code for
> ConsoleSink.
>
>
>
> I see the code has:
>
>
>
> data.sparkSession.createDataFrame(
> data.sparkSession.sparkContext.parallelize(data.collect()), data.schema)
> .show(*numRowsToShow*, *isTruncated*)
> }
>
>
>
> I was wondering why it does not do data directly? Why the collect and
> parallelize?
>
>
>
>
>
> Thanks,
>
>   Assaf.
>
>
>
> --
> View this message in context: cannot call explain or show on dataframe in
> structured streaming addBatch dataframe
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: the scheme in stream reader

2017-06-19 Thread Michael Armbrust

The socket source can't know how to parse your data.  I think the right
thing would be for it to throw an exception saying that you can't set the
schema here.  Would you mind opening a JIRA ticket?

If you are trying to parse data from something like JSON then you should
use from_json` on the value returned.

On Sun, Jun 18, 2017 at 12:27 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> L set the scheme for  DataStreamReader but when I print the scheme.It just
> printed:
> root
> |--value:string (nullable=true)
>
> My code is
>
> val line = ss.readStream.format("socket")
> .option("ip",xxx)
> .option("port",xxx)
> .scheme(StructField("name",StringType）::(StructField("age",
> IntegerType))).load
> line.printSchema
>
> My spark version is 2.1.0.
> I want the printSchema prints the schema I set in the code.How should I do
> please?
> And my original target is the received data from socket is handled as
> schema directly.What should I do please?
>
> thanks
> Fei Shao
>
>
>
>
>
>
>

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust

You might also try with a newer version.  Several instance of code
generation failures have been fixed since 2.0.

On Thu, Jun 15, 2017 at 1:15 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi Michael,
> Spark 2.0.2 - but I have a very interesting test case actually
> The optimiser seems to be at fault in a way, I've joined to this email the
> explain when I limit myself to 2 levels of struct mutation and when it goes
> to 5.
> As you can see the optimiser seems to be doing a lot more in the later
> case.
> After further investigation, the code is not "failing" per se - spark is
> trying the whole stage codegen, the compilation is failing due to the
> compilation error and I think it's falling back to the "non codegen" way.
>
> I'll try to create a simpler test case to reproduce this if I can, what do
> you think ?
>
> Regards,
>
> Olivier.
>
>
> 2017-06-15 21:08 GMT+02:00 Michael Armbrust <mich...@databricks.com>:
>
>> Which version of Spark?  If its recent I'd open a JIRA.
>>
>> On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> when we create recursive calls to "struct" (up to 5 levels) for
>>> extending a complex datastructure we end up with the following compilation
>>> error :
>>>
>>> org.codehaus.janino.JaninoRuntimeException: Code of method
>>> "(I[Lscala/collection/Iterator;)V" of class
>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator"
>>> grows beyond 64 KB
>>>
>>> The CreateStruct code itself is properly using the ctx.splitExpression
>>> command but the "end result" of the df.select( struct(struct(struct()
>>> ))) ends up being too much.
>>>
>>> Should I open a JIRA or is there a workaround ?
>>>
>>> Regards,
>>>
>>> --
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>>
>>
>>
>
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust

Which version of Spark?  If its recent I'd open a JIRA.

On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> when we create recursive calls to "struct" (up to 5 levels) for extending
> a complex datastructure we end up with the following compilation error :
>
> org.codehaus.janino.JaninoRuntimeException: Code of method
> "(I[Lscala/collection/Iterator;)V" of class "org.apache.spark.sql.
> catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
>
> The CreateStruct code itself is properly using the ctx.splitExpression
> command but the "end result" of the df.select( struct(struct(struct()
> ))) ends up being too much.
>
> Should I open a JIRA or is there a workaround ?
>
> Regards,
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-14 Thread Michael Armbrust

ub.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (
>>>>> https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>>>>>
>>>>>
>>>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>>>
>>>>> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (
>>>>> https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>>>>>
>>>>>
>>>>> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my
>>>>> tests and observations.
>>>>>
>>>>> This is failed in Spark 2.1.1. So, it sounds not a regression although
>>>>> it is a bug that should be fixed (whether in Spark or R).
>>>>>
>>>>>
>>>>> 2017-06-14 8:28 GMT+09:00 Xiao Li <gatorsm...@gmail.com>:
>>>>>
>>>>>> -1
>>>>>>
>>>>>> Spark 2.2 is unable to read the partitioned table created by Spark
>>>>>> 2.1 or earlier.
>>>>>>
>>>>>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>>>>>
>>>>>> Will fix it soon.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Xiao Li
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley <jos...@databricks.com>:
>>>>>>
>>>>>>> Re: the QA JIRAs:
>>>>>>> Thanks for discussing them.  I still feel they are very helpful; I
>>>>>>> particularly notice not having to spend a solid 2-3 weeks of time QAing
>>>>>>> (unlike in earlier Spark releases).  One other point not mentioned 
>>>>>>> above: I
>>>>>>> think they serve as a very helpful reminder/training for the community 
>>>>>>> for
>>>>>>> rigor in development.  Since we instituted QA JIRAs, contributors have 
>>>>>>> been
>>>>>>> a lot better about adding in docs early, rather than waiting until the 
>>>>>>> end
>>>>>>> of the cycle (though I know this is drawing conclusions from 
>>>>>>> correlations).
>>>>>>>
>>>>>>> I would vote in favor of the RC...but I'll wait to see about the
>>>>>>> reported failures.
>>>>>>>
>>>>>>> On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen <so...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Different errors as in https://issues.apache.org/j
>>>>>>>> ira/browse/SPARK-20520 but that's also reporting R test failures.
>>>>>>>>
>>>>>>>> I went back and tried to run the R tests and they passed, at least
>>>>>>>> on Ubuntu 17 / R 3.3.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath <
>>>>>>>> nick.pentre...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> All Scala, Python tests pass. ML QA and doc issues are resolved
>>>>>>>>> (as well as R it seems).
>>>>>>>>>
>>>>>>>>> However, I'm seeing the following test failure on R consistently:
>>>>>>>>> https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 8 Jun 2017 at 08:48 Denny Lee <denny.g@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 non-binding
>>>>>>>>>>
>>>>>>>>>> Tested on macOS Sierra, Ubuntu 16.04
>>>>>>>>>> test suite includes various test cases including Spark SQL, ML,
>>>>>>>>>> GraphFrames, Structured Streaming
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 7, 2017 at 9:40 PM vaquar khan <vaquar.k...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 non-bi

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust

Apologies for messing up the https urls.  My mistake.  I'll try to get it
right next time.

Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
knowing that they were unlikely to pass.  That said, I still think these
early RCs are valuable. I know several users that wanted to test new
features in 2.2 that have used them.  Now, if we would prefer to call them
preview or RC0 or something I'd be okay with that as well.

Regarding doc updates, I don't think it is a requirement that they be voted
on as part of the release.  Even if they are something version specific.  I
think we have regularly updated the website with documentation that was
merged after the release.

I personally don't think the QA umbrella JIRAs are particularly effective,
but I also wouldn't ban their use if others think they are.  However, I do
think that real QA needs an RC to test, so I think it is fine that there is
still outstanding QA to be done when an RC is cut.  For example, I plan to
run a bunch of streaming workloads on RC4 and will vote accordingly.

TLDR; Based on what I have heard from everyone so far, there are currently
no know issues that should fail the vote here.  We should begin testing
RC4.  Thanks to everyone for your help!

On Mon, Jun 5, 2017 at 1:20 PM, Sean Owen <so...@cloudera.com> wrote:

> (I apologize for going on about this, but I've asked ~4 times: could you
> make the URLs here in the form email HTTPS URLs? It sounds minor, but we're
> asking people to verify the integrity of software and hashes, and this is
> the one case where it is actually important.)
>
> The "2.2" JIRAs don't look like updates to the non-version-specific web
> pages. If they affect release docs (i.e. under spark.apache.org/docs/),
> or the code, those QA/doc updates have to happen before a release. Right? I
> feel like this is self-evident but this comes up every minor release, that
> some testing or doc changes for a release can happen after the code and
> docs for the release are finalized. They obviously can't.
>
> I know, I get it. I think the reality is that the reporters don't believe
> there is something must-do for the 2.2.0 release, or else they'd have
> spoken up. In that case, these should be closed already as they're
> semantically "Blockers" and we shouldn't make an RC that can't pass.
>
> ... or should we? Actually, to me the idea of an "RC0" release as a
> preview, and RCs that are known to fail for testing purposes seem OK. But
> if that's the purpose here, let's say it.
>
> If the "QA" JIRAs just represent that 'we will test things, in general',
> then I think they're superfluous at best. These aren't used consistently,
> and their intent isn't actionable (i.e. it sounds like no particular
> testing resolves the JIRA). They signal something that doesn't seem to
> match the intent.
>
> Can we close the QA JIRAs -- and are there any actual must-have docs not
> already in the 2.2 branch?
>
> On Mon, Jun 5, 2017 at 8:52 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> I commented on that JIRA, I don't think that should block the release.
>> We can support both options long term if this vote passes.  Looks like the
>> remaining JIRAs are doc/website updates that can happen after the vote or
>> QA that should be done on this RC.  I think we are ready to start testing
>> this release seriously!
>>
>> On Mon, Jun 5, 2017 at 12:40 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Xiao opened a blocker on 2.2.0 this morning:
>>>
>>> SPARK-20980 Rename the option `wholeFile` to `multiLine` for JSON and CSV
>>>
>>> I don't see that this should block?
>>>
>>> We still have 7 Critical issues:
>>>
>>> SPARK-20520 R streaming tests failed on Windows
>>> SPARK-20512 SparkR 2.2 QA: Programming guide, migration guide, vignettes
>>> updates
>>> SPARK-20499 Spark MLlib, GraphX 2.2 QA umbrella
>>> SPARK-20508 Spark R 2.2 QA umbrella
>>> SPARK-20513 Update SparkR website for 2.2
>>> SPARK-20510 SparkR 2.2 QA: Update user guide for new features & APIs
>>> SPARK-20507 Update MLlib, GraphX websites for 2.2
>>>
>>> I'm going to assume that the R test issue isn't actually that big a
>>> deal, and that the 2.2 items are done. Anything that really is for 2.2
>>> needs to block the release; Joseph what's the status on those?
>>>
>>> On Mon, Jun 5, 2017 at 8:15 PM Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00
>>>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust

I commented on that JIRA, I don't think that should block the release.  We
can support both options long term if this vote passes.  Looks like the
remaining JIRAs are doc/website updates that can happen after the vote or
QA that should be done on this RC.  I think we are ready to start testing
this release seriously!

On Mon, Jun 5, 2017 at 12:40 PM, Sean Owen <so...@cloudera.com> wrote:

> Xiao opened a blocker on 2.2.0 this morning:
>
> SPARK-20980 Rename the option `wholeFile` to `multiLine` for JSON and CSV
>
> I don't see that this should block?
>
> We still have 7 Critical issues:
>
> SPARK-20520 R streaming tests failed on Windows
> SPARK-20512 SparkR 2.2 QA: Programming guide, migration guide, vignettes
> updates
> SPARK-20499 Spark MLlib, GraphX 2.2 QA umbrella
> SPARK-20508 Spark R 2.2 QA umbrella
> SPARK-20513 Update SparkR website for 2.2
> SPARK-20510 SparkR 2.2 QA: Update user guide for new features & APIs
> SPARK-20507 Update MLlib, GraphX websites for 2.2
>
> I'm going to assume that the R test issue isn't actually that big a deal,
> and that the 2.2 items are done. Anything that really is for 2.2 needs to
> block the release; Joseph what's the status on those?
>
> On Mon, Jun 5, 2017 at 8:15 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc4
>> <https://github.com/apache/spark/tree/v2.2.0-rc4> (377cfa8ac7ff7a8
>> a6a6d273182e18ea7dc25ce7e)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1241/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>

[VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc4
 (
377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

Re: [VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust

This should probably fail the vote.  I'll follow up with an RC4.

On Fri, Jun 2, 2017 at 4:11 PM, Wenchen Fan <cloud0...@gmail.com> wrote:

> I'm -1 on this.
>
> I merged a PR <https://github.com/apache/spark/pull/18172> to master/2.2
> today and break the build. I'm really sorry for the trouble and I should
> not be so aggressive when merging PRs. The actual reason is some misleading
> comments in the code and a bug in Spark's testing framework that it never
> run REPL tests unless you change code in REPL module.
>
> I will be more careful in the future, and should NEVER backport
> non-bug-fix commits to an RC branch. Sorry again for the trouble!
>
> On Fri, Jun 2, 2017 at 2:40 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc3
>> <https://github.com/apache/spark/tree/v2.2.0-rc3> (cc5dbd55b0b312a
>> 661d21a4b605ce5ead2ba5218)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1239/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>

[VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc3
 (
cc5dbd55b0b312a661d21a4b605ce5ead2ba5218)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1239/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-06-02 Thread Michael Armbrust

This vote fails.  Following shortly with RC3

On Thu, Jun 1, 2017 at 8:28 PM, Reynold Xin <r...@databricks.com> wrote:

> Again (I've probably said this more than 10 times already in different
> threads), SPARK-18350 has no impact on whether the timestamp type is with
> timezone or without timezone. It simply allows a session specific timezone
> setting rather than having Spark always rely on the machine timezone.
>
> On Wed, May 31, 2017 at 11:58 AM, Kostas Sakellis <kos...@cloudera.com>
> wrote:
>
>> Hey Michael,
>>
>> There is a discussion on TIMESTAMP semantics going on the thread "SQL
>> TIMESTAMP semantics vs. SPARK-18350" which might impact Spark 2.2. Should
>> we make a decision there before voting on the next RC for Spark 2.2?
>>
>> Thanks,
>> Kostas
>>
>> On Tue, May 30, 2017 at 12:09 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Last call, anything else important in-flight for 2.2?
>>>
>>> On Thu, May 25, 2017 at 10:56 AM, Michael Allman <mich...@videoamp.com>
>>> wrote:
>>>
>>>> PR is here: https://github.com/apache/spark/pull/18112
>>>>
>>>>
>>>> On May 25, 2017, at 10:28 AM, Michael Allman <mich...@videoamp.com>
>>>> wrote:
>>>>
>>>> Michael,
>>>>
>>>> If you haven't started cutting the new RC, I'm working on a
>>>> documentation PR right now I'm hoping we can get into Spark 2.2 as a
>>>> migration note, even if it's just a mention: https://issues.apache
>>>> .org/jira/browse/SPARK-20888.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On May 22, 2017, at 11:39 AM, Michael Armbrust <mich...@databricks.com>
>>>> wrote:
>>>>
>>>> I'm waiting for SPARK-20814
>>>> <https://issues.apache.org/jira/browse/SPARK-20814> at Marcelo's
>>>> request and I'd also like to include SPARK-20844
>>>> <https://issues.apache.org/jira/browse/SPARK-20844>.  I think we
>>>> should be able to cut another RC midweek.
>>>>
>>>> On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> All the outstanding ML QA doc and user guide items are done for 2.2 so
>>>>> from that side we should be good to cut another RC :)
>>>>>
>>>>>
>>>>> On Thu, 18 May 2017 at 00:18 Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> Seeing an issue with the DataScanExec and some of our integration
>>>>>> tests for the SCC. Running dataframe read and writes from the shell seems
>>>>>> fine but the Redaction code seems to get a "None" when doing
>>>>>> SparkSession.getActiveSession.get in our integration tests. I'm not
>>>>>> sure why but i'll dig into this later if I get a chance.
>>>>>>
>>>>>> Example Failed Test
>>>>>> https://github.com/datastax/spark-cassandra-connector/blob/v
>>>>>> 2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/sp
>>>>>> ark/connector/sql/CassandraSQLSpec.scala#L311
>>>>>>
>>>>>> ```[info]   org.apache.spark.SparkException: Job aborted due to
>>>>>> stage failure: Task serialization failed: 
>>>>>> java.util.NoSuchElementException:
>>>>>> None.get
>>>>>> [info] java.util.NoSuchElementException: None.get
>>>>>> [info] at scala.None$.get(Option.scala:347)
>>>>>> [info] at scala.None$.get(Option.scala:345)
>>>>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org
>>>>>> $apache$spark$sql$execution$DataSourceScanExec$$redact(DataSo
>>>>>> urceScanExec.scala:70)
>>>>>> [info] at org.apache.spark.sql.execution
>>>>>> .DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>>>>>> [info] at org.apache.spark.sql.execution
>>>>>> .DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>>>>>> ```
>>>>>>
>>>>>> Again this only seems to repo in our IT suite so i'm not sure if this
>>>>>> is a real issue.
>>>>>>
>>>>>>
>>>>>> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley <jos...@databricks.c

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-30 Thread Michael Armbrust

Last call, anything else important in-flight for 2.2?

On Thu, May 25, 2017 at 10:56 AM, Michael Allman <mich...@videoamp.com>
wrote:

> PR is here: https://github.com/apache/spark/pull/18112
>
>
> On May 25, 2017, at 10:28 AM, Michael Allman <mich...@videoamp.com> wrote:
>
> Michael,
>
> If you haven't started cutting the new RC, I'm working on a documentation
> PR right now I'm hoping we can get into Spark 2.2 as a migration note, even
> if it's just a mention: https://issues.apache.org/jira/browse/SPARK-20888.
>
> Michael
>
>
> On May 22, 2017, at 11:39 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> I'm waiting for SPARK-20814
> <https://issues.apache.org/jira/browse/SPARK-20814> at Marcelo's
> request and I'd also like to include SPARK-20844
> <https://issues.apache.org/jira/browse/SPARK-20844>.  I think we should
> be able to cut another RC midweek.
>
> On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath <nick.pentre...@gmail.com
> > wrote:
>
>> All the outstanding ML QA doc and user guide items are done for 2.2 so
>> from that side we should be good to cut another RC :)
>>
>>
>> On Thu, 18 May 2017 at 00:18 Russell Spitzer <russell.spit...@gmail.com>
>> wrote:
>>
>>> Seeing an issue with the DataScanExec and some of our integration tests
>>> for the SCC. Running dataframe read and writes from the shell seems fine
>>> but the Redaction code seems to get a "None" when doing
>>> SparkSession.getActiveSession.get in our integration tests. I'm not
>>> sure why but i'll dig into this later if I get a chance.
>>>
>>> Example Failed Test
>>> https://github.com/datastax/spark-cassandra-connector/blob/
>>> v2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/
>>> spark/connector/sql/CassandraSQLSpec.scala#L311
>>>
>>> ```[info]   org.apache.spark.SparkException: Job aborted due to stage
>>> failure: Task serialization failed: java.util.NoSuchElementException:
>>> None.get
>>> [info] java.util.NoSuchElementException: None.get
>>> [info] at scala.None$.get(Option.scala:347)
>>> [info] at scala.None$.get(Option.scala:345)
>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org$
>>> apache$spark$sql$execution$DataSourceScanExec$$redact(
>>> DataSourceScanExec.scala:70)
>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$
>>> 4.apply(DataSourceScanExec.scala:54)
>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$
>>> 4.apply(DataSourceScanExec.scala:52)
>>> ```
>>>
>>> Again this only seems to repo in our IT suite so i'm not sure if this is
>>> a real issue.
>>>
>>>
>>> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley <jos...@databricks.com>
>>> wrote:
>>>
>>>> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.  Thanks
>>>> everyone who helped out on those!
>>>>
>>>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
>>>> essentially all for documentation.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin <van...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
>>>>> fixed, since the change that caused it is in branch-2.2. Probably a
>>>>> good idea to raise it to blocker at this point.
>>>>>
>>>>> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>>>>> <mich...@databricks.com> wrote:
>>>>> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
>>>>> create
>>>>> > another RC once ML has had time to take care of the more critical
>>>>> problems.
>>>>> > In the meantime please keep testing this release!
>>>>> >
>>>>> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki <
>>>>> ishiz...@jp.ibm.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> +1 (non-binding)
>>>>> >>
>>>>> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the
>>>>> tests for
>>>>> >> core have passed.
>>>>> >>
>>>>> >> $ java -version
>>>>> >> openjdk version "1.8.0_111"
>>>>> >> OpenJDK Ru

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Michael Armbrust

-dev

Have you tried clearing out the checkpoint directory?  Can you also give
the full stack trace?

On Wed, May 24, 2017 at 3:45 PM, kant kodali  wrote:

> Even if I do simple count aggregation like below I get the same error as
> https://issues.apache.org/jira/browse/SPARK-19268
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
>
>
> On Wed, May 24, 2017 at 3:35 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I am using Spark 2.1.1 and running in a Standalone mode using HDFS and
>> Kafka
>>
>> I am running into the same problem as https://issues.apache.org/jira
>> /browse/SPARK-19268 with my app(not KafkaWordCount).
>>
>> Here is my sample code
>>
>> *Here is how I create ReadStream*
>>
>> sparkSession.readStream()
>> .format("kafka")
>> .option("kafka.bootstrap.servers", 
>> config.getString("kafka.consumer.settings.bootstrapServers"))
>> .option("subscribe", 
>> config.getString("kafka.consumer.settings.topicName"))
>> .option("startingOffsets", "earliest")
>> .option("failOnDataLoss", "false")
>> .option("checkpointLocation", hdfsCheckPointDir)
>> .load();
>>
>>
>> *The core logic*
>>
>> Dataset df = ds.select(from_json(new Column("value").cast("string"), 
>> client.getSchema()).as("payload"));
>> Dataset df1 = df.selectExpr("payload.info.*", "payload.data.*");
>> Dataset df2 = df1.groupBy(window(df1.col("Timestamp5"), "24 hours", "24 
>> hours"), df1.col("AppName")).agg(sum("Amount"));
>> StreamingQuery query = df1.writeStream().foreach(new 
>> KafkaSink()).outputMode("update").start();
>> query.awaitTermination();
>>
>>
>> I can also provide any other information you may need.
>>
>> Thanks!
>>
>
>

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-22 Thread Michael Armbrust

I'm waiting for SPARK-20814
<https://issues.apache.org/jira/browse/SPARK-20814> at Marcelo's
request and I'd also like to include SPARK-20844
<https://issues.apache.org/jira/browse/SPARK-20844>.  I think we should be
able to cut another RC midweek.

On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> All the outstanding ML QA doc and user guide items are done for 2.2 so
> from that side we should be good to cut another RC :)
>
>
> On Thu, 18 May 2017 at 00:18 Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> Seeing an issue with the DataScanExec and some of our integration tests
>> for the SCC. Running dataframe read and writes from the shell seems fine
>> but the Redaction code seems to get a "None" when doing
>> SparkSession.getActiveSession.get in our integration tests. I'm not sure
>> why but i'll dig into this later if I get a chance.
>>
>> Example Failed Test
>> https://github.com/datastax/spark-cassandra-connector/
>> blob/v2.0.1/spark-cassandra-connector/src/it/scala/com/
>> datastax/spark/connector/sql/CassandraSQLSpec.scala#L311
>>
>> ```[info]   org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task serialization failed: java.util.NoSuchElementException:
>> None.get
>> [info] java.util.NoSuchElementException: None.get
>> [info] at scala.None$.get(Option.scala:347)
>> [info] at scala.None$.get(Option.scala:345)
>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org
>> $apache$spark$sql$execution$DataSourceScanExec$$
>> redact(DataSourceScanExec.scala:70)
>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$
>> anonfun$4.apply(DataSourceScanExec.scala:54)
>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$
>> anonfun$4.apply(DataSourceScanExec.scala:52)
>> ```
>>
>> Again this only seems to repo in our IT suite so i'm not sure if this is
>> a real issue.
>>
>>
>> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.  Thanks
>>> everyone who helped out on those!
>>>
>>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
>>> essentially all for documentation.
>>>
>>> Joseph
>>>
>>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin <van...@cloudera.com>
>>> wrote:
>>>
>>>> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
>>>> fixed, since the change that caused it is in branch-2.2. Probably a
>>>> good idea to raise it to blocker at this point.
>>>>
>>>> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>>>> <mich...@databricks.com> wrote:
>>>> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
>>>> create
>>>> > another RC once ML has had time to take care of the more critical
>>>> problems.
>>>> > In the meantime please keep testing this release!
>>>> >
>>>> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com
>>>> >
>>>> > wrote:
>>>> >>
>>>> >> +1 (non-binding)
>>>> >>
>>>> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the
>>>> tests for
>>>> >> core have passed.
>>>> >>
>>>> >> $ java -version
>>>> >> openjdk version "1.8.0_111"
>>>> >> OpenJDK Runtime Environment (build
>>>> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
>>>> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>>>> >> $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn
>>>> -Phadoop-2.7
>>>> >> package install
>>>> >> $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>>>> core
>>>> >> ...
>>>> >> Run completed in 15 minutes, 12 seconds.
>>>> >> Total number of tests run: 1940
>>>> >> Suites: completed 206, aborted 0
>>>> >> Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
>>>> >> All tests passed.
>>>> >> [INFO]
>>>> >> 
>>>> 
>>>> >> [INFO] BUILD SUCCESS
>>>> >> [INFO]
>>

[VOTE] Apache Spark 2.2.0 (RC2)

2017-05-04 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc2
 (
1d4017b44d5e6ad156abeaae6371747f111dd1f9)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1236/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-03 Thread Michael Armbrust

 at runtime, which 
>>>>>> causes
>>>>>> the no such method exception to occur.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Frank Austin Nothaft
>>>>>> fnoth...@berkeley.edu
>>>>>> fnoth...@eecs.berkeley.edu
>>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>>
>>>>>> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>> Frank,
>>>>>>
>>>>>> The issue you're running into is caused by using parquet-avro with
>>>>>> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
>>>>>> Spark can't update Avro because it is a breaking change that would force
>>>>>> users to rebuilt specific Avro classes in some cases. But you should be
>>>>>> free to use Avro 1.8 to avoid the problem.
>>>>>>
>>>>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>>>>> fnoth...@berkeley.edu> wrote:
>>>>>>
>>>>>>> Hi Ryan et al,
>>>>>>>
>>>>>>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>>>>>>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>>>>>>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as 
>>>>>>> a
>>>>>>> dependency. My colleague Michael (who posted earlier on this thread)
>>>>>>> documented this in Spark-19697
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that
>>>>>>> Spark has unit tests that check this compatibility issue, but it looks 
>>>>>>> like
>>>>>>> there was a recent change that sets a test scope dependency on Avro
>>>>>>> 1.8.0
>>>>>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>>>>>> which masks this issue in the unit tests. With this error, you can’t use
>>>>>>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Frank Austin Nothaft
>>>>>>> fnoth...@berkeley.edu
>>>>>>> fnoth...@eecs.berkeley.edu
>>>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>>>
>>>>>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rb...@netflix.com.INVALID
>>>>>>> <rb...@netflix.com.invalid>> wrote:
>>>>>>>
>>>>>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>>>>>>> execution, it implements the record materialization APIs in Parquet to 
>>>>>>> go
>>>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>>>>>>> dependency into Spark as far as I can tell.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> See discussion at https://github.com/apache/spark/pull/17163 -- I
>>>>>>>> think the issue is that fixing this trades one problem for a slightly
>>>>>>>> bigger one.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <heue...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2
>>>>>>>>> but does not bump the dependency version for avro (currently at 
>>>>>>>>> 1.7.7).
>>>>>>>>> Though perhaps not clear from the issue I reported [0], this means 
>>>>>>>>> that
>>>>>>>>> Spark is internally inconsistent, in that a call through parquet 
>>>>>>>>> (which
>>>>>>>>> depends on avro 1.8.0 [1]) may throw errors at runtime when it hits 
>>>>>>>>> avro
>>>>>>>>> 1.7.7 on the classpath.  Avro 1.8.0 is not binary compatible with 
>>

Re: [ANNOUNCE] Apache Spark 2.1.1

2017-05-03 Thread Michael Armbrust

Thanks for flagging this.  There was a bug in the git replication
<https://issues.apache.org/jira/browse/SPARK-20570> that has been fixed.

On Wed, May 3, 2017 at 1:29 AM, Ofir Manor <ofir.ma...@equalum.io> wrote:

> Looking good...
> one small things - the documentation on the web site is still 2.1.0
> Specifically, the home page has a link (under Documentation menu) labeled
> Latest Release (Spark 2.1.1), but when I click it, I get the 2.1.0
> documentation.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Wed, May 3, 2017 at 1:18 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> We are happy to announce the availability of Spark 2.1.1!
>>
>> Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1
>> maintenance branch of Spark. We strongly recommend all 2.1.x users to
>> upgrade to this stable release.
>>
>> To download Apache Spark 2.1.1 visit http://spark.apache.org/
>> downloads.html
>>
>> We would like to acknowledge all community members for contributing
>> patches to this release.
>>
>
>

[ANNOUNCE] Apache Spark 2.1.1

2017-05-02 Thread Michael Armbrust

We are happy to announce the availability of Spark 2.1.1!

Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1
maintenance branch of Spark. We strongly recommend all 2.1.x users to
upgrade to this stable release.

To download Apache Spark 2.1.1 visit http://spark.apache.org/downloads.html

We would like to acknowledge all community members for contributing patches
to this release.

Re: Spark 2.2.0 or Spark 2.3.0?

2017-05-02 Thread Michael Armbrust

An RC for 2.2.0
 was
released last week.  Please test.

Note that update mode has been supported since 2.0.

On Mon, May 1, 2017 at 10:43 PM, kant kodali  wrote:

> Hi All,
>
> If I understand the Spark standard release process correctly. It looks
> like the official release is going to be sometime end of this month and it
> is going to be 2.2.0 right (not 2.3.0)? I am eagerly looking for Spark
> 2.2.0 because of the "update mode" option in Spark Streaming. Please
> correct me if I am wrong.
>
> Thanks!
>

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Michael Armbrust

He's just suggesting that since the DataStreamWriter start() method can
fill in an option named "path", we should make that a synonym for "topic".
Then you could do something like.

df.writeStream.format("kafka").start("topic")

Seems reasonable if people don't think that is confusing.

On Mon, May 1, 2017 at 8:43 AM, Cody Koeninger  wrote:

> I'm confused about what you're suggesting.  Are you saying that a
> Kafka sink should take a filesystem path as an option?
>
> On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski  wrote:
> > Hi,
> >
> > I've just found out that KafkaSourceProvider supports topic option
> > that sets the Kafka topic to save a DataFrame to.
> >
> > You can also use topic column to assign rows to topics.
> >
> > Given the features, I've been wondering why "path" option is not
> > supported (even of least precedence) so when no topic column or option
> > are defined, save(path: String) would be the least priority.
> >
> > WDYT?
> >
> > It looks pretty trivial to support --> see KafkaSourceProvider at
> > lines [1] and [2] if I'm not mistaken.
> >
> > [1] https://github.com/apache/spark/blob/master/external/
> kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/
> KafkaSourceProvider.scala#L145
> > [2] https://github.com/apache/spark/blob/master/external/
> kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/
> KafkaSourceProvider.scala#L163
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust

All of those look like QA or documentation, which I don't think needs to
block testing on an RC (and in fact probably needs an RC to test?).
Joseph, please correct me if I'm wrong.  It is unlikely this first RC is
going to pass, but I wanted to get the ball rolling on testing 2.2.

On Thu, Apr 27, 2017 at 1:45 PM, Sean Owen <so...@cloudera.com> wrote:

> These are still blockers for 2.2:
>
> SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
> SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
> SPARK-20503 ML 2.2 QA: API: Python API coverage
> SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
> sealed audit
> SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
> SPARK-18813 MLlib 2.2 Roadmap
>
> Joseph you opened most of these just now. Is this an "RC0" we know won't
> pass? or, wouldn't we normally cut an RC after those things are ready?
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>

Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-27 Thread Michael Armbrust

I'll also +1

On Thu, Apr 27, 2017 at 4:20 AM, Sean Owen <so...@cloudera.com> wrote:

> +1 , same result as with the last RC. All checks out for me.
>
> On Thu, Apr 27, 2017 at 1:29 AM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.1-rc4
>> <https://github.com/apache/spark/tree/v2.1.1-rc4> (267aca5bd504230
>> 3a718d10635bc0d1a1596853f)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1232/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.0.
>>
>> *What happened to RC1?*
>>
>> There were issues with the release packaging and as a result was skipped.
>>
>

[VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc1
 (
8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1235/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

[VOTE] Apache Spark 2.1.1 (RC4)

2017-04-26 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.1-rc4
 (
267aca5bd5042303a718d10635bc0d1a1596853f)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1232/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.1.1?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.2 or 2.2.0.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.0.

*What happened to RC1?*

There were issues with the release packaging and as a result was skipped.

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Michael Armbrust

Yeah, I agree.

-1 (binding)

This vote fails, and I'll cut a new RC after #17749
<https://github.com/apache/spark/pull/17749> is merged.

On Mon, Apr 24, 2017 at 12:18 PM, Eric Liang <e...@databricks.com> wrote:

> -1 (non-binding)
>
> I also agree with using NEVER_INFER for 2.1.1. The migration cost is
> unexpected for a point release.
>
> On Mon, Apr 24, 2017 at 11:08 AM Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> Whoops, sorry finger slipped on that last message.
>> It sounds like whatever we do is going to break some existing users
>> (either with the tables by case sensitivity or with the unexpected scan).
>>
>> Personally I agree with Michael Allman on this, I believe we should
>> use INFER_NEVER for 2.1.1.
>>
>> On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> It
>>>
>>> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman <mich...@videoamp.com>
>>> wrote:
>>>
>>>> The trouble we ran into is that this upgrade was blocking access to our
>>>> tables, and we didn't know why. This sounds like a kind of migration
>>>> operation, but it was not apparent that this was the case. It took an
>>>> expert examining a stack trace and source code to figure this out. Would a
>>>> more naive end user be able to debug this issue? Maybe we're an unusual
>>>> case, but our particular experience was pretty bad. I have my doubts that
>>>> the schema inference on our largest tables would ever complete without
>>>> throwing some kind of timeout (which we were in fact receiving) or the end
>>>> user just giving up and killing our job. We ended up doing a rollback while
>>>> we investigated the source of the issue. In our case, INFER_NEVER is
>>>> clearly the best configuration. We're going to add that to our default
>>>> configuration files.
>>>>
>>>> My expectation is that a minor point release is a pretty safe bug fix
>>>> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>>>>
>>>> One suggestion the Spark team might consider is releasing 2.1.1 with
>>>> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of
>>>> up-front migration notes would help in identifying this new behavior in 
>>>> 2.2.
>>>>
>>>> Thanks,
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Apr 24, 2017, at 2:09 AM, Wenchen Fan <wenc...@databricks.com>
>>>> wrote:
>>>>
>>>> see https://issues.apache.org/jira/browse/SPARK-19611
>>>>
>>>> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Whats the regression this fixed in 2.1 from 2.0?
>>>>>
>>>>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan <wenc...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>>>>>> only scan all table files only once, and write back the inferred schema 
>>>>>> to
>>>>>> metastore so that we don't need to do the schema inference again.
>>>>>>
>>>>>> So technically this will introduce a performance regression for the
>>>>>> first query, but compared to branch-2.0, it's not performance regression.
>>>>>> And this patch fixed a regression in branch-2.1, which can run in
>>>>>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
>>>>>> default mode.
>>>>>>
>>>>>> + [Eric], what do you think?
>>>>>>
>>>>>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> Thanks for pointing this out, Michael.  Based on the conversation
>>>>>>> on the PR
>>>>>>> <https://github.com/apache/spark/pull/16944#issuecomment-285529275>
>>>>>>> this seems like a risky change to include in a release branch with a
>>>>>>> default other than NEVER_INFER.
>>>>>>>
>>>>>>> +Wenchen?  What do you think?
>>>>>>>
>>>>>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman <
>>>>>>> mich...@videoamp.com> wrote:
>>>

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Michael Armbrust

Thanks for pointing this out, Michael.  Based on the conversation on the PR
<https://github.com/apache/spark/pull/16944#issuecomment-285529275> this
seems like a risky change to include in a release branch with a default
other than NEVER_INFER.

+Wenchen?  What do you think?

On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman <mich...@videoamp.com>
wrote:

> We've identified the cause of the change in behavior. It is related to the
> SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This key and
> its related functionality was absent from our previous build. The default
> setting in the current build was causing Spark to attempt to scan all table
> files during query analysis. Changing this setting to NEVER_INFER disabled
> this operation and resolved the issue we had.
>
> Michael
>
>
> On Apr 20, 2017, at 3:42 PM, Michael Allman <mich...@videoamp.com> wrote:
>
> I want to caution that in testing a build from this morning's branch-2.1
> we found that Hive partition pruning was not working. We found that Spark
> SQL was fetching all Hive table partitions for a very simple query whereas
> in a build from several weeks ago it was fetching only the required
> partitions. I cannot currently think of a reason for the regression outside
> of some difference between branch-2.1 from our previous build and
> branch-2.1 from this morning.
>
> That's all I know right now. We are actively investigating to find the
> root cause of this problem, and specifically whether this is a problem in
> the Spark codebase or not. I will report back when I have an answer to that
> question.
>
> Michael
>
>
> On Apr 18, 2017, at 11:59 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc3
> <https://github.com/apache/spark/tree/v2.1.1-rc3> (2ed19cff2f6ab79
> a718526e5d16633412d8c4dd4)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1230/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> *What happened to RC1?*
>
> There were issues with the release packaging and as a result was skipped.
>
>
>
>

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-18 Thread Michael Armbrust

In case it wasn't obvious by the appearance of RC3, this vote failed.

On Thu, Mar 30, 2017 at 4:09 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc2
> <https://github.com/apache/spark/tree/v2.1.1-rc2> (
> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1227/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> *What happened to RC1?*
>
> There were issues with the release packaging and as a result was skipped.
>

[VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.1-rc3
 (
2ed19cff2f6ab79a718526e5d16633412d8c4dd4)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1230/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.1.1?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.2 or 2.2.0.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.0.

*What happened to RC1?*

There were issues with the release packaging and as a result was skipped.

branch-2.2 has been cut

2017-04-18 Thread Michael Armbrust

I just cut the release branch for Spark 2.2.  If you are merging important
bug fixes, please backport as appropriate.  If you have doubts if something
should be backported, please ping me.  I'll follow with an RC later this
week.

Re: 2.2 branch

2017-04-17 Thread Michael Armbrust

I'm going to cut branch-2.2 tomorrow morning.

On Thu, Apr 13, 2017 at 11:02 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> Yeah, I was delaying until 2.1.1 was out and some of the hive questions
> were resolved.  I'll make progress on that by the end of the week.  Lets
> aim for 2.2 branch cut next week.
>
> On Thu, Apr 13, 2017 at 8:56 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i see there is no 2.2 branch yet for spark. has this been pushed out
>> until after 2.1.1 is done?
>>
>>
>> thanks!
>>
>
>

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Michael Armbrust

Have time to figure out why the doc build failed?
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-docs/60/console

On Thu, Apr 13, 2017 at 9:39 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> If it would help I'd be more than happy to look at kicking off the
> packaging for RC3 since I'v been poking around in Jenkins a bit (for 
> SPARK-20216
> & friends) (I'd still probably need some guidance from a previous release
> coordinator so I understand if that's not actually faster).
>
> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>
>> I backported the fix into both branch-2.1 and branch-2.0. Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0x5CED8B896A6BDFA0
>>
>>
>> On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com> wrote:
>> > DB,
>> >
>> > This vote already failed and there isn't a RC3 vote yet. If you
>> backport the
>> > changes to branch-2.1 they will make it into the next RC.
>> >
>> > rb
>> >
>> > On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>> >>
>> >> -1
>> >>
>> >> I think that back-porting SPARK-20270 and SPARK-18555 are very
>> important
>> >> since it's a critical bug that na.fill will mess up the data in Long
>> even
>> >> the data isn't null.
>> >>
>> >> Thanks.
>> >>
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> --
>> >> Web: https://www.dbtsai.com
>> >> PGP Key ID: 0x5CED8B896A6BDFA0
>> >>
>> >> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau <hol...@pigscanfly.ca>
>> >> wrote:
>> >>>
>> >>> Following up, the issues with missing pypandoc/pandoc on the packaging
>> >>> machine has been resolved.
>> >>>
>> >>> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau <hol...@pigscanfly.ca>
>> >>> wrote:
>> >>>>
>> >>>> See SPARK-20216, if Michael can let me know which machine is being
>> used
>> >>>> for packaging I can see if I can install pandoc on it (should be
>> simple but
>> >>>> I know the Jenkins cluster is a bit on the older side).
>> >>>>
>> >>>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau <hol...@pigscanfly.ca>
>> >>>> wrote:
>> >>>>>
>> >>>>> So the fix is installing pandoc on whichever machine is used for
>> >>>>> packaging. I thought that was generally done on the machine of the
>> person
>> >>>>> rolling the release so I wasn't sure it made sense as a JIRA, but
>> from
>> >>>>> chatting with Josh it sounds like that part might be on of the
>> Jenkins
>> >>>>> workers - is there a fixed one that is used?
>> >>>>>
>> >>>>> Regardless I'll file a JIRA for this when I get back in front of my
>> >>>>> desktop (~1 hour or so).
>> >>>>>
>> >>>>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust
>> >>>>> <mich...@databricks.com> wrote:
>> >>>>>>
>> >>>>>> Thanks for the comments everyone.  This vote fails.  Here's how I
>> >>>>>> think we should proceed:
>> >>>>>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>> >>>>>>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>> >>>>>> report if this is a regression and if there is an easy fix that we
>> should
>> >>>>>> wait for.
>> >>>>>>
>> >>>>>> For all the other test failures, please take the time to look
>> through
>> >>>>>> JIRA and open an issue if one does not already exist so that we
>> can triage
>> >>>>>> if these are just environmental issues.  If I don't hear any
>> objections I'm
>> >>>>>> going to go ahead with RC3 tomorrow.
>> >>>>>>
>> >>>>>> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung
>> >>>>>> <felixcheun...@hotmail.com> wrote:
>> >>&g

Re: 2.2 branch

2017-04-13 Thread Michael Armbrust

Yeah, I was delaying until 2.1.1 was out and some of the hive questions
were resolved.  I'll make progress on that by the end of the week.  Lets
aim for 2.2 branch cut next week.

On Thu, Apr 13, 2017 at 8:56 AM, Koert Kuipers  wrote:

> i see there is no 2.2 branch yet for spark. has this been pushed out until
> after 2.1.1 is done?
>
>
> thanks!
>

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Michael Armbrust

Thanks for the comments everyone.  This vote fails.  Here's how I think we
should proceed:
 - [SPARK-20197] - SparkR CRAN - appears to be resolved
 - [SPARK-] - Python packaging - Holden, please file a JIRA and report
if this is a regression and if there is an easy fix that we should wait for.

For all the other test failures, please take the time to look through JIRA
and open an issue if one does not already exist so that we can triage if
these are just environmental issues.  If I don't hear any objections I'm
going to go ahead with RC3 tomorrow.

On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> -1
> sorry, found an issue with SparkR CRAN check.
> Opened SPARK-20197 and working on fix.
>
> --
> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of
> Holden Karau <hol...@pigscanfly.ca>
> *Sent:* Friday, March 31, 2017 6:25:20 PM
> *To:* Xiao Li
> *Cc:* Michael Armbrust; dev@spark.apache.org
> *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2)
>
> -1 (non-binding)
>
> Python packaging doesn't seem to have quite worked out (looking
> at PKG-INFO the description is "Description: ! missing pandoc do not
> upload to PyPI "), ideally it would be nice to have this as a version
> we upgrade to PyPi.
> Building this on my own machine results in a longer description.
>
> My guess is that whichever machine was used to package this is missing the
> pandoc executable (or possibly pypandoc library).
>
> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> +1
>>
>> Xiao
>>
>> 2017-03-30 16:09 GMT-07:00 Michael Armbrust <mich...@databricks.com>:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.1-rc2
>>> <https://github.com/apache/spark/tree/v2.1.1-rc2> (
>>> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1227/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.0.
>>>
>>> *What happened to RC1?*
>>>
>>> There were issues with the release packaging and as a result was skipped.
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

[VOTE] Apache Spark 2.1.1 (RC2)

2017-03-30 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.1-rc2
 (
02b165dcc2ee5245d1293a375a31660c9d4e1fa6)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1227/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.1.1?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.2 or 2.2.0.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.0.

*What happened to RC1?*

There were issues with the release packaging and as a result was skipped.

Re: Outstanding Spark 2.1.1 issues

2017-03-28 Thread Michael Armbrust

We just fixed the build yesterday.  I'll kick off a new RC today.

On Tue, Mar 28, 2017 at 8:04 AM, Asher Krim <ak...@hubspot.com> wrote:

> Hey Michael,
> any update on this? We're itching for a 2.1.1 release (specifically
> SPARK-14804 which is currently blocking us)
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>
> On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> An update: I cut the tag for RC1 last night.  Currently fighting with the
>> release process.  Will post RC1 once I get it working.
>>
>> On Tue, Mar 21, 2017 at 2:16 PM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>>> As for SPARK-19759 <https://issues.apache.org/jira/browse/SPARK-19759>,
>>> I don't think that needs to be targeted for 2.1.1 so we don't need to worry
>>> about it
>>>
>>>
>>> On Tue, 21 Mar 2017 at 13:49 Holden Karau <hol...@pigscanfly.ca> wrote:
>>>
>>>> I agree with Michael, I think we've got some outstanding issues but
>>>> none of them seem like regression from 2.1 so we should be good to start
>>>> the RC process.
>>>>
>>>> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>> Please speak up if I'm wrong, but none of these seem like critical
>>>> regressions from 2.1.  As such I'll start the RC process later today.
>>>>
>>>> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>> I'm not super sure it should be a blocker for 2.1.1 -- is it a
>>>> regression? Maybe we can get TDs input on it?
>>>>
>>>> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu <zhunanmcg...@gmail.com> wrote:
>>>>
>>>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
>>>> blocker
>>>>
>>>> Best,
>>>>
>>>> Nan
>>>>
>>>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung <
>>>> felixcheun...@hotmail.com> wrote:
>>>>
>>>> I've been scrubbing R and think we are tracking 2 issues
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-19237
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-19925
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of
>>>> Holden Karau <hol...@pigscanfly.ca>
>>>> *Sent:* Monday, March 20, 2017 3:12:35 PM
>>>> *To:* dev@spark.apache.org
>>>> *Subject:* Outstanding Spark 2.1.1 issues
>>>>
>>>> Hi Spark Developers!
>>>>
>>>> As we start working on the Spark 2.1.1 release I've been looking at our
>>>> outstanding issues still targeted for it. I've tried to break it down by
>>>> component so that people in charge of each component can take a quick look
>>>> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
>>>> the overall list is pretty short (only 9 items - 5 if we only look at
>>>> explicitly tagged) :)
>>>>
>>>> If your working on something for Spark 2.1.1 and it doesn't show up in
>>>> this list please speak up now :) We have a lot of issues (including "in
>>>> progress") that are listed as impacting 2.1.0, but they aren't targeted for
>>>> 2.1.1 - if there is something you are working in their which should be
>>>> targeted for 2.1.1 please let us know so it doesn't slip through the 
>>>> cracks.
>>>>
>>>> The query string I used for looking at the 2.1.1 open issues is:
>>>>
>>>> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
>>>> 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
>>>> Unresolved ORDER BY priority DESC
>>>>
>>>> None of the open issues appear to be a regression from 2.1.0, but those
>>>> seem more likely to show up during the RC process (thanks in advance to
>>>> everyone testing their workloads :)) & generally none of them seem to be
>>>>
>>>> (Note: the cfs are for Target Version/s field)
>>>>
>>>> Critical Issues:
>>>>  SQL:
>>>>   SPARK-19690 <https://issues.apache.org/jira/browse/SPARK-19690> - Join
>&

Re: Outstanding Spark 2.1.1 issues

2017-03-22 Thread Michael Armbrust

An update: I cut the tag for RC1 last night.  Currently fighting with the
release process.  Will post RC1 once I get it working.

On Tue, Mar 21, 2017 at 2:16 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> As for SPARK-19759 <https://issues.apache.org/jira/browse/SPARK-19759>, I
> don't think that needs to be targeted for 2.1.1 so we don't need to worry
> about it
>
>
> On Tue, 21 Mar 2017 at 13:49 Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> I agree with Michael, I think we've got some outstanding issues but none
>> of them seem like regression from 2.1 so we should be good to start the RC
>> process.
>>
>> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>> Please speak up if I'm wrong, but none of these seem like critical
>> regressions from 2.1.  As such I'll start the RC process later today.
>>
>> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>> I'm not super sure it should be a blocker for 2.1.1 -- is it a
>> regression? Maybe we can get TDs input on it?
>>
>> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu <zhunanmcg...@gmail.com> wrote:
>>
>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
>> blocker
>>
>> Best,
>>
>> Nan
>>
>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>> I've been scrubbing R and think we are tracking 2 issues
>>
>> https://issues.apache.org/jira/browse/SPARK-19237
>>
>> https://issues.apache.org/jira/browse/SPARK-19925
>>
>>
>>
>>
>> --
>> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of
>> Holden Karau <hol...@pigscanfly.ca>
>> *Sent:* Monday, March 20, 2017 3:12:35 PM
>> *To:* dev@spark.apache.org
>> *Subject:* Outstanding Spark 2.1.1 issues
>>
>> Hi Spark Developers!
>>
>> As we start working on the Spark 2.1.1 release I've been looking at our
>> outstanding issues still targeted for it. I've tried to break it down by
>> component so that people in charge of each component can take a quick look
>> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
>> the overall list is pretty short (only 9 items - 5 if we only look at
>> explicitly tagged) :)
>>
>> If your working on something for Spark 2.1.1 and it doesn't show up in
>> this list please speak up now :) We have a lot of issues (including "in
>> progress") that are listed as impacting 2.1.0, but they aren't targeted for
>> 2.1.1 - if there is something you are working in their which should be
>> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>>
>> The query string I used for looking at the 2.1.1 open issues is:
>>
>> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
>> 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
>> Unresolved ORDER BY priority DESC
>>
>> None of the open issues appear to be a regression from 2.1.0, but those
>> seem more likely to show up during the RC process (thanks in advance to
>> everyone testing their workloads :)) & generally none of them seem to be
>>
>> (Note: the cfs are for Target Version/s field)
>>
>> Critical Issues:
>>  SQL:
>>   SPARK-19690 <https://issues.apache.org/jira/browse/SPARK-19690> - Join
>> a streaming DataFrame with a batch DataFrame may not work - PR
>> https://github.com/apache/spark/pull/17052 (review in progress by
>> zsxwing, currently failing Jenkins)*
>>
>> Major Issues:
>>  SQL:
>>   SPARK-19035 <https://issues.apache.org/jira/browse/SPARK-19035> - rand()
>> function in case when cause failed - no outstanding PR (consensus on JIRA
>> seems to be leaning towards it being a real issue but not necessarily
>> everyone agrees just yet - maybe we should slip this?)*
>>  Deploy:
>>   SPARK-19522 <https://issues.apache.org/jira/browse/SPARK-19522>
>>  - --executor-memory flag doesn't work in local-cluster mode -
>> https://github.com/apache/spark/pull/16975 (review in progress by
>> vanzin, but PR currently stalled waiting on response) *
>>  Core:
>>   SPARK-20025 <https://issues.apache.org/jira/browse/SPARK-20025> - Driver
>> fail over will not work, if SPARK_LOCAL* env is set. -
>> https://github.com/apache/spark/pull/17357 (waiting on review) *
>>  PySpark:
>>  SPARK-19955 <https://issu

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Michael Armbrust

Please speak up if I'm wrong, but none of these seem like critical
regressions from 2.1.  As such I'll start the RC process later today.

On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau  wrote:

> I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
> Maybe we can get TDs input on it?
>
> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>
>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
>> blocker
>>
>> Best,
>>
>> Nan
>>
>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
>> wrote:
>>
>> I've been scrubbing R and think we are tracking 2 issues
>>
>> https://issues.apache.org/jira/browse/SPARK-19237
>>
>> https://issues.apache.org/jira/browse/SPARK-19925
>>
>>
>>
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Monday, March 20, 2017 3:12:35 PM
>> *To:* dev@spark.apache.org
>> *Subject:* Outstanding Spark 2.1.1 issues
>>
>> Hi Spark Developers!
>>
>> As we start working on the Spark 2.1.1 release I've been looking at our
>> outstanding issues still targeted for it. I've tried to break it down by
>> component so that people in charge of each component can take a quick look
>> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
>> the overall list is pretty short (only 9 items - 5 if we only look at
>> explicitly tagged) :)
>>
>> If your working on something for Spark 2.1.1 and it doesn't show up in
>> this list please speak up now :) We have a lot of issues (including "in
>> progress") that are listed as impacting 2.1.0, but they aren't targeted for
>> 2.1.1 - if there is something you are working in their which should be
>> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>>
>> The query string I used for looking at the 2.1.1 open issues is:
>>
>> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
>> 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
>> Unresolved ORDER BY priority DESC
>>
>> None of the open issues appear to be a regression from 2.1.0, but those
>> seem more likely to show up during the RC process (thanks in advance to
>> everyone testing their workloads :)) & generally none of them seem to be
>>
>> (Note: the cfs are for Target Version/s field)
>>
>> Critical Issues:
>>  SQL:
>>   SPARK-19690  - Join
>> a streaming DataFrame with a batch DataFrame may not work - PR
>> https://github.com/apache/spark/pull/17052 (review in progress by
>> zsxwing, currently failing Jenkins)*
>>
>> Major Issues:
>>  SQL:
>>   SPARK-19035  - rand()
>> function in case when cause failed - no outstanding PR (consensus on JIRA
>> seems to be leaning towards it being a real issue but not necessarily
>> everyone agrees just yet - maybe we should slip this?)*
>>  Deploy:
>>   SPARK-19522 
>>  - --executor-memory flag doesn't work in local-cluster mode -
>> https://github.com/apache/spark/pull/16975 (review in progress by
>> vanzin, but PR currently stalled waiting on response) *
>>  Core:
>>   SPARK-20025  - Driver
>> fail over will not work, if SPARK_LOCAL* env is set. -
>> https://github.com/apache/spark/pull/17357 (waiting on review) *
>>  PySpark:
>>  SPARK-19955  -
>> Update run-tests to support conda [ Part of Dropping 2.6 support -- which
>> we shouldn't do in a minor release -- but also fixes pip installability
>> tests to run in Jenkins ]-  PR failing Jenkins (I need to poke this some
>> more, but seems like 2.7 support works but some other issues. Maybe slip to
>> 2.2?)
>>
>> Minor issues:
>>  Tests:
>>   SPARK-19612  - Tests
>> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
>> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
>> consider explicitly targeting this for 2.2?
>>  PySpark:
>>   SPARK-19570  - Allow
>> to disable hive in pyspark shell - https://github.com/apache/
>> spark/pull/16906 PR exists but its difficult to add automated tests for
>> this (although if SPARK-19955
>>  gets in would make
>> testing this easier) - no reviewers yet. Possible re-target?*
>>  Structured Streaming:
>>   SPARK-19613  - Flaky
>> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
>> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
>> this for 2.2?
>>  ML:
>>   SPARK-19759 
>>  - ALSModel.predict on Dataframes

Spark 2.2 Code-freeze - 3/20

2017-03-15 Thread Michael Armbrust

Hey Everyone,

Just a quick announcement that I'm planning to cut the branch for Spark 2.2
this coming Monday (3/20).  Please try and get things merged before then
and also please begin retargeting of any issues that you don't think will
make the release.

Michael

Re: Should we consider a Spark 2.1.1 release?

2017-03-15 Thread Michael Armbrust

Hey Holden,

Thanks for bringing this up!  I think we usually cut patch releases when
there are enough fixes to justify it.  Sometimes just a few weeks after the
release.  I guess if we are at 3 months Spark 2.1.0 was a pretty good
release :)

That said, it is probably time. I was about to start thinking about 2.2 as
well (we are a little past the posted code-freeze deadline), so I'm happy
to push the buttons etc (this is a very good description
 if you are curious). I would
love help watching JIRA, posting the burn down on issues and shepherding in
any critical patches.  Feel free to ping me off-line if you like to
coordinate.

Unless there are any objections, how about we aim for an RC of 2.1.1 on
Monday and I'll also plan to cut branch-2.2 then?  (I'll send a separate
email on this as well).

Michael

On Mon, Mar 13, 2017 at 1:40 PM, Holden Karau  wrote:

> I'd be happy to do the work of coordinating a 2.1.1 release if that's a
> thing a committer can do (I think the release coordinator for the most
> recent Arrow release was a committer and the final publish step took a PMC
> member to upload but other than that I don't remember any issues).
>
> On Mon, Mar 13, 2017 at 1:05 PM Sean Owen  wrote:
>
>> It seems reasonable to me, in that other x.y.1 releases have followed ~2
>> months after the x.y.0 release and it's been about 3 months since 2.1.0.
>>
>> Related: creating releases is tough work, so I feel kind of bad voting
>> for someone else to do that much work. Would it make sense to deputize
>> another release manager to help get out just the maintenance releases? this
>> may in turn mean maintenance branches last longer. Experienced hands can
>> continue to manage new minor and major releases as they require more
>> coordination.
>>
>> I know most of the release process is written down; I know it's also
>> still going to be work to make it 100% documented. Eventually it'll be
>> necessary to make sure it's entirely codified anyway.
>>
>> Not pushing for it myself, just noting I had heard this brought up in
>> side conversations before.
>>
>>
>> On Mon, Mar 13, 2017 at 7:07 PM Holden Karau 
>> wrote:
>>
>> Hi Spark Devs,
>>
>> Spark 2.1 has been out since end of December
>> 
>> and we've got quite a few fixes merged for 2.1.1
>> 
>> .
>>
>> On the Python side one of the things I'd like to see us get out into a
>> patch release is a packaging fix (now merged) before we upload to PyPI &
>> Conda, and we also have the normal batch of fixes like toLocalIterator for
>> large DataFrames in PySpark.
>>
>> I've chatted with Felix & Shivaram who seem to think the R side is
>> looking close to in good shape for a 2.1.1 release to submit to CRAN (if
>> I've miss-spoken my apologies). The two outstanding issues that are being
>> tracked for R are SPARK-18817, SPARK-19237.
>>
>> Looking at the other components quickly it seems like structured
>> streaming could also benefit from a patch release.
>>
>> What do others think - are there any issues people are actively targeting
>> for 2.1.1? Is this too early to be considering a patch release?
>>
>> Cheers,
>>
>> Holden
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-16 Thread Michael Armbrust

Thanks for your interest in Apache Spark Structured Streaming!

There is nothing secret in that demo, though I did make some configuration
changes in order to get the timing right (gotta have some dramatic effect
:) ).  Also I think the visualizations based on metrics output by the
StreamingQueryListener

are
still being rolled out, but should be available everywhere soon.

First, I set two options to make sure that files were read one at a time,
thus allowing us to see incremental results.

spark.readStream
  .option("maxFilesPerTrigger", "1")
  .option("latestFirst", "true")
...

There is more detail on how these options work in this post

.

Regarding continually updating result of a streaming query using display(df)for
streaming DataFrames (i.e. one created with spark.readStream), that has
worked in Databrick's since Spark 2.1.  The longer form example we
published requires you to rerun the count to see it change at the end of
the notebook because that is not a streaming query. Instead it is a batch
query over data that has been written out by another stream.  I'd like to
add the ability to run a streaming query from data that has been written
out by the FileSink (tracked here SPARK-19633
).

In the demo, I started two different streaming queries:
 - one that reads from json / kafka => writes to parquet
 - one that reads from json / kafka => writes to memory sink

/ pushes latest answer to the js running in a browser using the
StreamingQueryListener
.
This is packaged up nicely in display(), but there is nothing stopping you
from building something similar with vanilla Apache Spark.

Michael

On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin 
wrote:

> Hey folks
>
> This one is mainly aimed at the databricks folks, I have been trying to
> replicate the cloudtrail demo
>  Micheal did at Spark
> Summit. The code for it can be found here
> 
>
> My question is how did you get the results to be displayed and updated
> continusly in real time
>
> I am also using databricks to duplicate it but I noticed the code link
> mentions
>
>  "If you count the number of rows in the table, you should find the value
> increasing over time. Run the following every few minutes."
> This leads me to believe that the version of Databricks that Micheal was
> using for the demo is still not released, or at-least the functionality to
> display those changes in real time aren't
>
> Is this the case? or am I completely wrong?
>
> Can I display the results of a structured streaming query in realtime
> using the databricks "display" function?
>
>
> Regards
> Sam
>

Re: benefits of code gen

2017-02-10 Thread Michael Armbrust

Function1 is specialized, but nullSafeEval is Any => Any, so that's still
going to box in the non-codegened execution path.

On Fri, Feb 10, 2017 at 1:32 PM, Koert Kuipers  wrote:

> based on that i take it that math functions would be primary beneficiaries
> since they work on primitives.
>
> so if i take UnaryMathExpression as an example, would i not get the same
> benefit if i change it to this?
>
> abstract class UnaryMathExpression(val f: Double => Double, name: String)
>   extends UnaryExpression with Serializable with ImplicitCastInputTypes {
>
>   override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType)
>   override def dataType: DataType = DoubleType
>   override def nullable: Boolean = true
>   override def toString: String = s"$name($child)"
>   override def prettyName: String = name
>
>   protected override def nullSafeEval(input: Any): Any = {
> f(input.asInstanceOf[Double])
>   }
>
>   // name of function in java.lang.Math
>   def funcName: String = name.toLowerCase
>
>   def function(d: Double): Double = f(d)
>
>   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
> val self = ctx.addReferenceObj(name, this, getClass.getName)
> defineCodeGen(ctx, ev, c => s"$self.function($c)")
>   }
> }
>
> admittedly in this case the benefit in terms of removing complex codegen
> is not there (the codegen was only one line), but if i can remove codegen
> here i could also remove it in lots of other places where it does get very
> unwieldy simply by replacing it with calls to methods.
>
> Function1 is specialized, so i think (or hope) that my version does no
> extra boxes/unboxing.
>
> On Fri, Feb 10, 2017 at 2:24 PM, Reynold Xin  wrote:
>
>> With complex types it doesn't work as well, but for primitive types the
>> biggest benefit of whole stage codegen is that we don't even need to put
>> the intermediate data into rows or columns anymore. They are just variables
>> (stored in CPU registers).
>>
>> On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers  wrote:
>>
>>> so i have been looking for a while now at all the catalyst expressions,
>>> and all the relative complex codegen going on.
>>>
>>> so first off i get the benefit of codegen to turn a bunch of chained
>>> iterators transformations into a single codegen stage for spark. that makes
>>> sense to me, because it avoids a bunch of overhead.
>>>
>>> but what i am not so sure about is what the benefit is of converting the
>>> actual stuff that happens inside the iterator transformations into codegen.
>>>
>>> say if we have an expression that has 2 children and creates a struct
>>> for them. why would this be faster in codegen by re-creating the code to do
>>> this in a string (which is complex and error prone) compared to simply have
>>> the codegen call the normal method for this in my class?
>>>
>>> i see so much trivial code be re-created in codegen. stuff like this:
>>>
>>>   private[this] def castToDateCode(
>>>   from: DataType,
>>>   ctx: CodegenContext): CastFunction = from match {
>>> case StringType =>
>>>   val intOpt = ctx.freshName("intOpt")
>>>   (c, evPrim, evNull) => s"""
>>> scala.Option $intOpt =
>>>   org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDat
>>> e($c);
>>> if ($intOpt.isDefined()) {
>>>   $evPrim = ((Integer) $intOpt.get()).intValue();
>>> } else {
>>>   $evNull = true;
>>> }
>>>"""
>>>
>>> is this really faster than simply calling an equivalent functions from
>>> the codegen, and keeping the codegen logic restricted to the "unrolling" of
>>> chained iterators?
>>>
>>>
>>
>

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust

The JSON log is only used by the file sink (which it doesn't seem like you
are using).  Though, I'm not sure exactly what is going on inside of
setupGoogle or how tableReferenceSource is used.

Typically you would run df.writeStream.option("path", "/my/path")... and
then the transaction log would go into /my/path/_spark_metadata.

There is not requirement that a sink uses this kind of a update log.  This
is just how we get better transactional semantics than HDFS is providing.
If your sink supports transactions natively you should just use those
instead.  We pass a unique ID to the sink method addBatch so that you can
make sure you don't commit the same transaction more than once.

On Tue, Feb 7, 2017 at 3:29 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Hi Micheal
>
> If thats the case for the below example, where should i be reading these
> json log files first? im assuming sometime between df and query?
>
>
> val df = spark
> .readStream
> .option("tableReferenceSource",tableName)
> .load()
> setUpGoogle(spark.sqlContext)
>
> val query = df
>   .writeStream
>   .option("tableReferenceSink",tableName2)
>   .option("checkpointLocation","checkpoint")
>   .start()
>
>
> On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Read the JSON log of files that is in `/your/path/_spark_metadata` and
>> only read files that are present in that log (ignore anything else).
>>
>> On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>>> Ah I see ok so probably it's the retry that's causing it
>>>
>>> So when you say I'll have to take this into account, how do I best do
>>> that? My sink will have to know what was that extra file. And i was under
>>> the impression spark would automagically know this because of the
>>> checkpoint directory set when you created the writestream
>>>
>>> If that's not the case then how would I go about ensuring no duplicates?
>>>
>>>
>>> Thanks again for the awesome support!
>>>
>>> Regards
>>> Sam
>>> On Tue, 7 Feb 2017 at 18:05, Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>>> Sorry, I think I was a little unclear.  There are two things at play
>>>> here.
>>>>
>>>>  - Exactly-once semantics with file output: spark writes out extra
>>>> metadata on which files are valid to ensure that failures don't cause us to
>>>> "double count" any of the input.  Spark 2.0+ detects this info
>>>> automatically when you use dataframe reader (spark.read...). There may be
>>>> extra files, but they will be ignored. If you are consuming the output with
>>>> another system you'll have to take this into account.
>>>>  - Retries: right now we always retry the last batch when restarting.
>>>> This is safe/correct because of the above, but we could also optimize this
>>>> away by tracking more information about batch progress.
>>>>
>>>> On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>> Hmm ok I understand that but the job is running for a good few mins
>>>> before I kill it so there should not be any jobs left because I can see in
>>>> the log that its now polling for new changes, the latest offset is the
>>>> right one
>>>>
>>>> After I kill it and relaunch it picks up that same file?
>>>>
>>>>
>>>> Sorry if I misunderstood you
>>>>
>>>> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>> It is always possible that there will be extra jobs from failed
>>>> batches. However, for the file sink, only one set of files will make it
>>>> into _spark_metadata directory log.  This is how we get atomic commits even
>>>> when there are files in more than one directory.  When reading the files
>>>> with Spark, we'll detect this directory and use it instead of listStatus to
>>>> find the list of valid files.
>>>>
>>>> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>> On another note, when it comes to checkpointing on structured streaming
>>>>
>>>> I noticed if I have  a stream running off s3 and I kill the process.
>>>> The nex

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust

Read the JSON log of files that is in `/your/path/_spark_metadata` and only
read files that are present in that log (ignore anything else).

On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Ah I see ok so probably it's the retry that's causing it
>
> So when you say I'll have to take this into account, how do I best do
> that? My sink will have to know what was that extra file. And i was under
> the impression spark would automagically know this because of the
> checkpoint directory set when you created the writestream
>
> If that's not the case then how would I go about ensuring no duplicates?
>
>
> Thanks again for the awesome support!
>
> Regards
> Sam
> On Tue, 7 Feb 2017 at 18:05, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Sorry, I think I was a little unclear.  There are two things at play here.
>>
>>  - Exactly-once semantics with file output: spark writes out extra
>> metadata on which files are valid to ensure that failures don't cause us to
>> "double count" any of the input.  Spark 2.0+ detects this info
>> automatically when you use dataframe reader (spark.read...). There may be
>> extra files, but they will be ignored. If you are consuming the output with
>> another system you'll have to take this into account.
>>  - Retries: right now we always retry the last batch when restarting.
>> This is safe/correct because of the above, but we could also optimize this
>> away by tracking more information about batch progress.
>>
>> On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> Hmm ok I understand that but the job is running for a good few mins
>> before I kill it so there should not be any jobs left because I can see in
>> the log that its now polling for new changes, the latest offset is the
>> right one
>>
>> After I kill it and relaunch it picks up that same file?
>>
>>
>> Sorry if I misunderstood you
>>
>> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>> It is always possible that there will be extra jobs from failed batches.
>> However, for the file sink, only one set of files will make it into
>> _spark_metadata directory log.  This is how we get atomic commits even when
>> there are files in more than one directory.  When reading the files with
>> Spark, we'll detect this directory and use it instead of listStatus to find
>> the list of valid files.
>>
>> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> On another note, when it comes to checkpointing on structured streaming
>>
>> I noticed if I have  a stream running off s3 and I kill the process. The
>> next time the process starts running it dulplicates the last record
>> inserted. is that normal?
>>
>>
>>
>>
>> So say I have streaming enabled on one folder "test" which only has two
>> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
>> When I rerun the stream it picks up "update 2" again
>>
>> Is this normal? isnt ctrl+c a failure?
>>
>> I would expect checkpointing to know that update 2 was already processed
>>
>> Regards
>> Sam
>>
>> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> Thanks Micheal!
>>
>>
>>
>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>
>> We should add this soon.
>>
>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> Hi All
>>
>> When trying to read a stream off S3 and I try and drop duplicates I get
>> the following error:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> Append output mode not supported when there are streaming aggregations on
>> streaming DataFrames/DataSets;;
>>
>>
>> Whats strange if I use the batch "spark.read.json", it works
>>
>> Can I assume you cant drop duplicates in structured streaming
>>
>> Regards
>> Sam
>>
>>
>>
>>
>>
>>
>>
>>

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust

Sorry, I think I was a little unclear.  There are two things at play here.

 - Exactly-once semantics with file output: spark writes out extra metadata
on which files are valid to ensure that failures don't cause us to "double
count" any of the input.  Spark 2.0+ detects this info automatically when
you use dataframe reader (spark.read...). There may be extra files, but
they will be ignored. If you are consuming the output with another system
you'll have to take this into account.
 - Retries: right now we always retry the last batch when restarting.  This
is safe/correct because of the above, but we could also optimize this away
by tracking more information about batch progress.

On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Hmm ok I understand that but the job is running for a good few mins before
> I kill it so there should not be any jobs left because I can see in the log
> that its now polling for new changes, the latest offset is the right one
>
> After I kill it and relaunch it picks up that same file?
>
>
> Sorry if I misunderstood you
>
> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> It is always possible that there will be extra jobs from failed batches.
>> However, for the file sink, only one set of files will make it into
>> _spark_metadata directory log.  This is how we get atomic commits even when
>> there are files in more than one directory.  When reading the files with
>> Spark, we'll detect this directory and use it instead of listStatus to find
>> the list of valid files.
>>
>> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>>> On another note, when it comes to checkpointing on structured streaming
>>>
>>> I noticed if I have  a stream running off s3 and I kill the process. The
>>> next time the process starts running it dulplicates the last record
>>> inserted. is that normal?
>>>
>>>
>>>
>>>
>>> So say I have streaming enabled on one folder "test" which only has two
>>> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
>>> When I rerun the stream it picks up "update 2" again
>>>
>>> Is this normal? isnt ctrl+c a failure?
>>>
>>> I would expect checkpointing to know that update 2 was already processed
>>>
>>> Regards
>>> Sam
>>>
>>> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Micheal!
>>>>
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>>>>
>>>>> We should add this soon.
>>>>>
>>>>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin <hussam.ela...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> When trying to read a stream off S3 and I try and drop duplicates I
>>>>>> get the following error:
>>>>>>
>>>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>>>> Append output mode not supported when there are streaming aggregations on
>>>>>> streaming DataFrames/DataSets;;
>>>>>>
>>>>>>
>>>>>> Whats strange if I use the batch "spark.read.json", it works
>>>>>>
>>>>>> Can I assume you cant drop duplicates in structured streaming
>>>>>>
>>>>>> Regards
>>>>>> Sam
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust

It is always possible that there will be extra jobs from failed batches.
However, for the file sink, only one set of files will make it into
_spark_metadata directory log.  This is how we get atomic commits even when
there are files in more than one directory.  When reading the files with
Spark, we'll detect this directory and use it instead of listStatus to find
the list of valid files.

On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> On another note, when it comes to checkpointing on structured streaming
>
> I noticed if I have  a stream running off s3 and I kill the process. The
> next time the process starts running it dulplicates the last record
> inserted. is that normal?
>
>
>
>
> So say I have streaming enabled on one folder "test" which only has two
> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
> When I rerun the stream it picks up "update 2" again
>
> Is this normal? isnt ctrl+c a failure?
>
> I would expect checkpointing to know that update 2 was already processed
>
> Regards
> Sam
>
> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
>> Thanks Micheal!
>>
>>
>>
>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>>
>>> We should add this soon.
>>>
>>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>> When trying to read a stream off S3 and I try and drop duplicates I get
>>>> the following error:
>>>>
>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>> Append output mode not supported when there are streaming aggregations on
>>>> streaming DataFrames/DataSets;;
>>>>
>>>>
>>>> Whats strange if I use the batch "spark.read.json", it works
>>>>
>>>> Can I assume you cant drop duplicates in structured streaming
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>
>>>
>>
>

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust

Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497

We should add this soon.

On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin  wrote:

> Hi All
>
> When trying to read a stream off S3 and I try and drop duplicates I get
> the following error:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Append
> output mode not supported when there are streaming aggregations on
> streaming DataFrames/DataSets;;
>
>
> Whats strange if I use the batch "spark.read.json", it works
>
> Can I assume you cant drop duplicates in structured streaming
>
> Regards
> Sam
>

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust

-dev

You can use withColumn to change the type after the data has been loaded

.

On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin  wrote:

> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hi Sam
>> Remove the " from the number that it will work
>>
>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" 
>> escreveu:
>>
>>> Hi All
>>>
>>> I would like to specify a schema when reading from a json but when
>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>> with no joy!
>>>
>>>
>>> When inferring the schema customer id is set to String, and I would like
>>> to cast it as Double
>>>
>>> so df1 is corrupted while df2 shows
>>>
>>>
>>> Also FYI I need this to be generic as I would like to apply it to any
>>> json, I specified the below schema as an example of the issue I am facing
>>>
>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>>> DoubleType,FloatType, StructType, LongType,DecimalType}
>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>> val df1 = 
>>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> val df2 = 
>>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> df1.show(1)
>>> df2.show(1)
>>>
>>>
>>> Any help would be appreciated, I am sure I am missing something obvious
>>> but for the life of me I cant tell what it is!
>>>
>>>
>>> Kind Regards
>>> Sam
>>>
>>
>

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Michael Armbrust

+1, we should just fix the error to explain why months aren't allowed and
suggest that you manually specify some number of days.

On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz 
wrote:

> Thanks for the response Burak,
>
> As any sane person I try to steer away from the objects which have both
> calendar and unsafe in their fully qualified names but if there is no
> bigger picture I missed here I would go with 1 as well. And of course fix
> the error message. I understand this has been introduced with structured
> streaming in mind, but it is an useful feature in general, not only in high
> precision scale. To be honest I would love to see some generalized version
> which could be used (I mean without hacking) with arbitrary numeric
> sequence. It could address at least some scenarios in which people try to
> use window functions without PARTITION BY clause and fail miserably.
>
> Regarding ambiguity... Sticking with days doesn't really resolve the
> problem, does it? If one were to nitpick it doesn't look like this
> implementation even touches all the subtleties of DST or leap second.
>
>
>
>
> On 01/18/2017 05:52 PM, Burak Yavuz wrote:
>
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix the
> implementation. I'll leave it to the community to comment on. The code
> right now disallows intervals provided in months and years, because they
> are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
> or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
> used in finance), 365 days or 366 days?
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> Can I ask for some clarifications regarding intended behavior of window /
>> TimeWindow?
>>
>> PySpark documentation states that "Windows in the order of months are not
>> supported". This is further confirmed by the checks in
>> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>>
>> Surprisingly enough we can pass interval much larger than a month by
>> expressing interval in days or another unit of a higher precision. So this
>> fails:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>>
>> while following is accepted:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>>
>> with results which look sensible at first glance.
>>
>> Is it a matter of a faulty validation logic (months will be assigned only
>> if there is a match against years or months https://git.io/vMPdi) or
>> expected behavior and I simply misunderstood the intentions?
>>
>> --
>> Best,
>> Maciej
>>
>>
>
>

Re: StateStoreSaveExec / StateStoreRestoreExec

2017-01-03 Thread Michael Armbrust

You might also be interested in this:
https://issues.apache.org/jira/browse/SPARK-19031

On Tue, Jan 3, 2017 at 3:36 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> I think we should add something similar to mapWithState in 2.2.  It would
> be great if you could add the description of your problem to this ticket:
> https://issues.apache.org/jira/browse/SPARK-19067
>
> On Mon, Jan 2, 2017 at 2:05 PM, Jeremy Smith <jeremy.sm...@acorns.com>
> wrote:
>
>> I have a question about state tracking in Structured Streaming.
>>
>> First let me briefly explain my use case: Given a mutable data source
>> (i.e. an RDBMS) in which we assume we can retrieve a set of newly created
>> row versions (being a row that was created or updated between two given
>> `Offset`s, whatever those are), we can create a Structured Streaming
>> `Source` which retrieves the new row versions. Further assuming that every
>> logical row has some primary key, then as long as we can track the current
>> offset for each primary key, we can differentiate between new and updated
>> rows. Then, when a row is updated, we can record that the previous version
>> of that row expired at some particular time. That's essentially what I'm
>> trying to do. This would effectively give you an "event-sourcing" type of
>> historical/immutable log of changes out of a mutable data source.
>>
>> I noticed that in Spark 2.0.1 there was a concept of a StateStore, which
>> seemed like it would allow me to do exactly the tracking that I needed, so
>> I decided to try and use that built-in functionality rather than some
>> external key/value store for storing the current "version number" of each
>> primary key. There were a lot of hard-coded hoops I had to jump through,
>> but I eventually made it work by implementing some custom LogicalPlans and
>> SparkPlans around StateStore[Save/Restore]Exec.
>>
>> Now, in Spark 2.1.0 it seems to have gotten even further away from what I
>> was using it for - the keyExpressions of StateStoreSaveExec must include a
>> timestamp column, which means that those expressions are not really keys
>> (at least not for a logical row). So it appears I can't use it that way
>> anymore (I can't blame Spark for this, as I knew what I was getting into
>> when leveraging developer APIs). There are also several hard-coded checks
>> which now make it clear that StateStore functionality is only to be used
>> for streaming aggregates, which is not really what I'm doing.
>>
>> My question is - is there a good way to accomplish the above use case
>> within Structured Streaming? Or is this the wrong use case for the state
>> tracking functionality (which increasingly seems to be targeted toward
>> aggregates only)? Is there a plan for any kind of generalized
>> `mapWithState`-type functionality for Structured Streaming, or should I
>> just give up on that and use an external key/value store for my state
>> tracking?
>>
>> Thanks,
>> Jeremy
>>
>
>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread Michael Armbrust

An encoder uses reflection

to generate expressions that can extract data out of an object (by calling
methods on the object) and encode its contents directly into the tungsten
binary row format (and vice versa).  We codegenerate bytecode that
evaluates these expression in the same way that we code generate code for
normal expression evaluation in query processing.  However, this reflection
only works for simple ATDs
.  Another key thing to
realize is that we do this reflection / code generation at runtime, so we
aren't constrained by binary compatibility across versions of spark.

UDTs let you write custom code that translates an object into into a
generic row, which we can then translate into Spark's internal format
(using a RowEncoder). Unlike expressions and tungsten binary encoding, the
Row type that you return here is a stable public API that hasn't changed
since Spark 1.3.

So to summarize, if encoders don't work for your specific types you can use
UDTs, but they probably won't be as efficient. I'd love to unify these code
paths more, but its actually a fair amount of work to come up with a good
stable public API that doesn't sacrifice performance.

On Tue, Dec 27, 2016 at 6:32 AM, dragonly  wrote:

> I'm recently reading the source code of the SparkSQL project, and found
> some
> interesting databricks blogs about the tungsten project. I've roughly read
> through the encoder and unsafe representation part of the tungsten
> project(haven't read the algorithm part such as cache friendly hashmap
> algorithms).
> Now there's a big puzzle in front of me about the codegen of SparkSQL and
> how does the codegen utilize the tungsten encoding between JMV objects and
> unsafe bits.
> So can anyone tell me that's the main difference in situations where I
> write
> a UDT like ExamplePointUDT in SparkSQL or just create an ArrayType which
> can
> be handled by the tungsten encoder? I'll really appreciate it if you can go
> through some concrete code examples. thanks a lot!
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/What-is-mainly-
> different-from-a-UDT-and-a-spark-internal-type-that-
> ExpressionEncoder-recognized-tp20370.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: ability to provide custom serializers

2016-12-05 Thread Michael Armbrust

Lets start with a new ticket, link them and we can merge if the solution
ends up working out for both cases.

On Sun, Dec 4, 2016 at 5:39 PM, Erik LaBianca <erik.labia...@gmail.com>
wrote:

> Thanks Michael!
>
> On Dec 2, 2016, at 7:29 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> I would love to see something like this.  The closest related ticket is
> probably https://issues.apache.org/jira/browse/SPARK-7768 (though maybe
> there are enough people using UDTs in their current form that we should
> just make a new ticket)
>
>
> I’m not very familiar with UDT’s. Is this something I should research or
> just leave it be and create a new ticket? I did notice the presence of a
> registry in the source code but it seemed like it was targeted at a
> different use case.
>
> A few thoughts:
>  - even if you can do implicit search, we probably also want a registry
> for Java users.
>
>
> That’s fine. I’m not 100% sure I can get the right implicit in scope as
> things stand anyway, so let’s table that idea for now and do the registry.
>
>  - what is the output of the serializer going to be? one challenge here is
> that encoders write directly into the tungsten format, which is not a
> stable public API. Maybe this is more obvious if I understood MappedColumnType
> better?
>
>
> My assumption was that the output would be existing scalar data types. So
> string, long, double, etc. What I’d like to do is just “layer” the new ones
> on top already existing ones, kinda like the case case encoder does.
>
> Either way, I'm happy to give further advice if you come up with a more
> concrete proposal and put it on JIRA.
>
>
> Great, let me know and I’ll create a ticket, or we can re-use SPARK-7768
> and we can move the discussion there.
>
> Thanks!
>
> —erik
>
>

Re: ability to provide custom serializers

2016-12-02 Thread Michael Armbrust

I would love to see something like this.  The closest related ticket is
probably https://issues.apache.org/jira/browse/SPARK-7768 (though maybe
there are enough people using UDTs in their current form that we should
just make a new ticket)

A few thoughts:
 - even if you can do implicit search, we probably also want a registry for
Java users.
 - what is the output of the serializer going to be? one challenge here is
that encoders write directly into the tungsten format, which is not a
stable public API. Maybe this is more obvious if I understood MappedColumnType
better?

Either way, I'm happy to give further advice if you come up with a more
concrete proposal and put it on JIRA.

On Fri, Dec 2, 2016 at 4:03 PM, Erik LaBianca 
wrote:

> Hi All,
>
> Apologies in advance for any confusing terminology, I’m still pretty new
> to Spark.
>
> I’ve got a bunch of Scala case class “domain objects” from an existing
> application. Many of them contain simple, but unsupported-by-spark types in
> them, such as case class Foo(timestamp: java.time.Instant). I’d like to be
> able to use these case classes directly in a DataSet, but can’t, since
> there’s no encoder available for java.time.Instant. I’d like to resolve
> that.
>
> I asked around on the gitter channel, and was pointed to the
> ScalaReflections class, which handles creating Encoder[T] for a variety of
> things, including case classes and their members. Barring a better
> solution, what I’d like is to be able to add some additional case
> statements to the serializerFor and deserializeFor methods, dispatching to
> something along the lines of the Slick MappedColumnType[1]. In an ideal
> scenario, I could provide these mappings via implicit search, but I’d be
> happy to settle for a registry of some sort too.
>
> Does this idea make sense, in general? I’m interested in taking a stab at
> the implementation, but Jakob recommended I surface it here first to see if
> there were any plans around this sort of functionality already.
>
> Thanks!
>
> —erik
>
> 1. http://slick.lightbend.com/doc/3.0.0/userdefined.html#
> using-custom-scalar-types-in-queries
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Flink event session window in Spark

2016-12-02 Thread Michael Armbrust

Here is the JIRA for adding this feature:
https://issues.apache.org/jira/browse/SPARK-10816

On Fri, Dec 2, 2016 at 11:20 AM, Fritz Budiyanto 
wrote:

> Hi All,
>
> I need help on how to implement Flink event session window in Spark. Is
> this possible?
>
> For instance, I wanted to create a session window with a timeout of 10
> minutes (see Flink snippet below)
> Continues event will make the session window alive. If there are no
> activity for 10 minutes, the session window shall close and forward the
> data to a sink function.
>
> // event-time session windows
> input
>
> .keyBy()
>
> .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
>
> .();
>
>
>
>
> Any idea ?
>
> Thanks,
> Fritz
>

Re: Analyzing and reusing cached Datasets

2016-11-19 Thread Michael Armbrust

You are hitting a weird optimization in withColumn.  Specifically, to avoid
building up huge trees with chained calls to this method, we collapse
projections eagerly (instead of waiting for the optimizer).

Typically we look for cached data in between analysis and optimization, so
that optimizations won't change out ability to recognized cached query
plans.  However, in this case the eager optimization is thwarting that.

On Sat, Nov 19, 2016 at 12:19 PM, Jacek Laskowski  wrote:

> Hi,
>
> There might be a bug in how analyzing Datasets or looking up cached
> Datasets works. I'm on master.
>
> ➜  spark git:(master) ✗ ./bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
>   /_/
>
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_112
> Branch master
> Compiled by user jacek on 2016-11-19T08:39:43Z
> Revision 2a40de408b5eb47edba92f9fe92a42ed1e78bf98
> Url https://github.com/apache/spark.git
> Type --help for more information.
>
> After reviewing CacheManager and how caching works for Datasets I
> thought the following query would use the cached Dataset but it does
> not.
>
> // Cache Dataset -- it is lazy
> scala> val df = spark.range(1).cache
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
>
> // Trigger caching
> scala> df.show
> +---+
> | id|
> +---+
> |  0|
> +---+
>
> // Visit http://localhost:4040/storage to see the Dataset cached. And it
> is.
>
> // Use the cached Dataset in another query
> // Notice InMemoryRelation in use for cached queries
> // It works as expected.
> scala> df.withColumn("newId", 'id).explain(extended = true)
> == Parsed Logical Plan ==
> 'Project [*, 'id AS newId#16]
> +- Range (0, 1, step=1, splits=Some(8))
>
> == Analyzed Logical Plan ==
> id: bigint, newId: bigint
> Project [id#0L, id#0L AS newId#16L]
> +- Range (0, 1, step=1, splits=Some(8))
>
> == Optimized Logical Plan ==
> Project [id#0L, id#0L AS newId#16L]
> +- InMemoryRelation [id#0L], true, 1, StorageLevel(disk, memory,
> deserialized, 1 replicas)
>   +- *Range (0, 1, step=1, splits=Some(8))
>
> == Physical Plan ==
> *Project [id#0L, id#0L AS newId#16L]
> +- InMemoryTableScan [id#0L]
>   +- InMemoryRelation [id#0L], true, 1, StorageLevel(disk,
> memory, deserialized, 1 replicas)
> +- *Range (0, 1, step=1, splits=Some(8))
>
> I hoped that the following query would use the cached one but it does
> not. Should it? I thought that QueryExecution.withCachedData [1] would
> do the trick.
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L70
>
> // The following snippet uses spark.range(1) which is the same as the
> one cached above
> // Why does the physical plan not use InMemoryTableScan and
> InMemoryRelation?
> scala> spark.range(1).withColumn("new", 'id).explain(extended = true)
> == Parsed Logical Plan ==
> 'Project [*, 'id AS new#29]
> +- Range (0, 1, step=1, splits=Some(8))
>
> == Analyzed Logical Plan ==
> id: bigint, new: bigint
> Project [id#26L, id#26L AS new#29L]
> +- Range (0, 1, step=1, splits=Some(8))
>
> == Optimized Logical Plan ==
> Project [id#26L, id#26L AS new#29L]
> +- Range (0, 1, step=1, splits=Some(8))
>
> == Physical Plan ==
> *Project [id#26L, id#26L AS new#29L]
> +- *Range (0, 1, step=1, splits=Some(8))
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Multiple streaming aggregations in structured streaming

2016-11-18 Thread Michael Armbrust

Doing this generally is pretty hard.  We will likely support algebraic
aggregate eventually, but this is not currently slotted for 2.2.  Instead I
think we will add something like mapWithState that lets users compute
arbitrary stateful things.  What is your use case?

On Wed, Nov 16, 2016 at 6:58 PM, wszxyh  wrote:

> Hi
>
> Multiple streaming aggregations are not yet supported. When will it be
> supported? Is it in the plan?
>
> Thanks
>
>
>
>

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Michael Armbrust

On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon  wrote:

> Maybe it sounds like you are looking for from_json/to_json functions after
> en/decoding properly.
>

Which are new built-in functions that will be released with Spark 2.1.

1 2 3 4 >

1 - 100 of 342 matches

Mail list logo