Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread Tathagata Das
1. Yes. All times in event time, not processing time. So you may get 10AM
event time data at 11AM processing time, but it will still be compared
again all data within 9-10AM event times.

2. Show us your code.

On Thu, Feb 27, 2020 at 2:30 AM lec ssmi  wrote:

> Hi:
> I'm new to structured streaming. Because the built-in API cannot
> perform the Count Distinct operation of Window, I want to use
> dropDuplicates first, and then perform the window count.
>But in the process of using, there are two problems:
>1. Because it is streaming computing, in the process of
> deduplication, the state needs to be cleared in time, which requires the
> cooperation of watermark. Assuming my event time field is consistently
>   increasing, and I set the watermark to 1 hour, does it mean
> that the data at 10 o'clock will only be compared in these data from 9
> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
>2. Because it is window deduplication, I set the watermark
> before deduplication to the window size.But after deduplication, I need to
> call withWatermark () again to set the watermark to the real
>watermark. Will setting the watermark again take effect?
>
>  Thanks a lot !
>


Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Matei Zaharia
+1 on this new rubric. It definitely captures the issues I’ve seen in Spark and 
in other projects. If we write down this rubric (or something like it), it will 
also be easier to refer to it during code reviews or in proposals of new APIs 
(we could ask “do you expect to have to change this API in the future, and if 
so, how”).

Matei

> On Feb 24, 2020, at 3:02 PM, Michael Armbrust  wrote:
> 
> Hello Everyone,
> 
> As more users have started upgrading to Spark 3.0 preview (including myself), 
> there have been many discussions around APIs that have been broken compared 
> with Spark 2.x. In many of these discussions, one of the rationales for 
> breaking an API seems to be "Spark follows semantic versioning 
> , so this major release is 
> our chance to get it right [by breaking APIs]". Similarly, in many cases the 
> response to questions about why an API was completely removed has been, "this 
> API has been deprecated since x.x, so we have to remove it".
> 
> As a long time contributor to and user of Spark this interpretation of the 
> policy is concerning to me. This reasoning misses the intention of the 
> original policy, and I am worried that it will hurt the long-term success of 
> the project.
> 
> I definitely understand that these are hard decisions, and I'm not proposing 
> that we never remove anything from Spark. However, I would like to give some 
> additional context and also propose a different rubric for thinking about API 
> breakage moving forward.
> 
> Spark adopted semantic versioning back in 2014 during the preparations for 
> the 1.0 release. As this was the first major release -- and as, up until 
> fairly recently, Spark had only been an academic project -- no real promises 
> had been made about API stability ever.
> 
> During the discussion, some committers suggested that this was an opportunity 
> to clean up cruft and give the Spark APIs a once-over, making cosmetic 
> changes to improve consistency. However, in the end, it was decided that in 
> many cases it was not in the best interests of the Spark community to break 
> things just because we could. Matei actually said it pretty forcefully 
> :
> 
> I know that some names are suboptimal, but I absolutely detest breaking APIs, 
> config names, etc. I’ve seen it happen way too often in other projects (even 
> things we depend on that are officially post-1.0, like Akka or Protobuf or 
> Hadoop), and it’s very painful. I think that we as fairly cutting-edge users 
> are okay with libraries occasionally changing, but many others will consider 
> it a show-stopper. Given this, I think that any cosmetic change now, even 
> though it might improve clarity slightly, is not worth the tradeoff in terms 
> of creating an update barrier for existing users.
> 
> In the end, while some changes were made, most APIs remained the same and 
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this 
> served the project very well, as compatibility means users are able to 
> upgrade and we keep as many people on the latest versions of Spark (though 
> maybe not the latest APIs of Spark) as possible.
> 
> As Spark grows, I think compatibility actually becomes more important and we 
> should be more conservative rather than less. Today, there are very likely 
> more Spark programs running than there were at any other time in the past. 
> Spark is no longer a tool only used by advanced hackers, it is now also 
> running "traditional enterprise workloads.'' In many cases these jobs are 
> powering important processes long after the original author leaves.
> 
> Broken APIs can also affect libraries that extend Spark. This dependency can 
> be even harder for users, as if the library has not been upgraded to use new 
> APIs and they need that library, they are stuck.
> 
> Given all of this, I'd like to propose the following rubric as an addition to 
> our semantic versioning policy. After discussion and if people agree this is 
> a good idea, I'll call a vote of the PMC to ratify its inclusion in the 
> official policy.
> 
> Considerations When Breaking APIs
> The Spark project strives to avoid breaking APIs or silently changing 
> behavior, even at major versions. While this is not always possible, the 
> balance of the following factors should be considered before choosing to 
> break an API.
> 
> Cost of Breaking an API
> Breaking an API almost always has a non-trivial cost to the users of Spark. A 
> broken API means that Spark programs need to be rewritten before they can be 
> upgraded. However, there are a few considerations when thinking about what 
> the cost will be:
> Usage - an API that is actively used in many different places, is always very 
> costly to break. While it is hard to know usage for sure, there are a bunch 
> of ways that we can estimate: 

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Michael Armbrust
Thanks for the discussion! A few responses:

The decision needs to happen at api/config change time, otherwise the
> deprecated warning has no purpose if we are never going to remove them.
>

Even if we never remove an API, I think deprecation warnings (when done
right) can still serve a purpose. For new users, a deprecation can serve as
a pointer to newer, faster APIs or ones with less sharp edges. I would be
supportive of efforts that use them to clean up the docs. For example, we
could hide deprecated APIs after some time so they don't clutter scala/java
doc. We can and should audit things like the user guide and our own
examples to make sure they don't use deprecated APIs.


> That said we still need to be able to remove deprecated things and change
> APIs in major releases, otherwise why do a  major release in the first
> place.  Is it purely to support newer Scala/python/java versions.
>

I don't think Major versions are purely for
Scala/Java/Python/Hive/Metastore, but they are a good chance to move the
project forward. Spark 3.0 has a lot of upgrades here, and I think we did
make the right trade-offs here, even though there are some API breaks.

Major versions are also a good time to drop major changes (i.e. in 2.0 we
released whole-stage code gen).


> I think the hardest part listed here is what the impact is.  Who's call is
> that, it's hard to know how everyone is using things and I think it's been
> harder to get feedback on SPIPs and API changes in general as people are
> busy with other things.
>

This is the hardest part, and we won't always get it right. I think that
having the rubric though will help guide the conversation and help
reviewers ask the right questions.

One other thing I'll add is, sometimes the users come to us and we should
listen! I was very surprised by the response to Karen's email on this list
last week. An actual user was giving us feedback on the impact of the
changes in Spark 3.0 and rather than listen there was a lot of push back.
Users are never wrong when they are telling you what matters to them!


> Like you mention, I think stackoverflow is unreliable, the posts could be
> many years old and no longer relevant.
>

While this is unfortunate, I think the more we can do to keep these
answer relevant (either by updating them or by not breaking them) is good
for the health of the Spark community.


Re: Clarification on the commit protocol

2020-02-27 Thread Michael Armbrust
No, it is not. Although the commit protocol has mostly been superseded by Delta
Lake , which is available as a separate open source
project that works natively with Apache Spark. In contrast to the commit
protocol, Delta can guarantee full ACID (rather than just partition level
atomicity). It also has better performance in many cases, as it reduces the
amount of metadata that needs to be retrieved from the storage system.

On Wed, Feb 26, 2020 at 9:36 PM rahul c  wrote:

> Hi team,
>
> Just wanted to understand.
> Is DBIO commit protocol available in open source spark version ?
>


Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Tom Graves
 In general +1 I think these are good guidelines and making it easier to 
upgrade is beneficial to everyone.  The decision needs to happen at api/config 
change time, otherwise the deprecated warning has no purpose if we are never 
going to remove them.That said we still need to be able to remove deprecated 
things and change APIs in major releases, otherwise why do a  major release in 
the first place.  Is it purely to support newer Scala/python/java versions.
I think the hardest part listed here is what the impact is.  Who's call is 
that, it's hard to know how everyone is using things and I think it's been 
harder to get feedback on SPIPs and API changes in general as people are busy 
with other things. Like you mention, I think stackoverflow is unreliable, the 
posts could be many years old and no longer relevant. 
TomOn Monday, February 24, 2020, 05:03:44 PM CST, Michael Armbrust 
 wrote:  
 
 
Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), 
there have been many discussions around APIs that have been broken compared 
with Spark 2.x. In many of these discussions, one of the rationales for 
breaking an API seems to be "Spark follows semantic versioning, so this major 
release is our chance to get it right [by breaking APIs]". Similarly, in many 
cases the response to questions about why an API was completely removed has 
been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the 
policy is concerning to me. This reasoning misses the intention of the original 
policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing 
that we never remove anything from Spark. However, I would like to give some 
additional context and also propose a different rubric for thinking about API 
breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 
1.0 release. As this was the first major release -- and as, up until fairly 
recently, Spark had only been an academic project -- no real promises had been 
made about API stability ever.


During the discussion, some committers suggested that this was an opportunity 
to clean up cruft and give the Spark APIs a once-over, making cosmetic changes 
to improve consistency. However, in the end, it was decided that in many cases 
it was not in the best interests of the Spark community to break things just 
because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, 
config names, etc. I’ve seen it happen way too often in other projects (even 
things we depend on that are officially post-1.0, like Akka or Protobuf or 
Hadoop), and it’s very painful. I think that we as fairly cutting-edge users 
are okay with libraries occasionally changing, but many others will consider it 
a show-stopper. Given this, I think that any cosmetic change now, even though 
it might improve clarity slightly, is not worth the tradeoff in terms of 
creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users 
of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served 
the project very well, as compatibility means users are able to upgrade and we 
keep as many people on the latest versions of Spark (though maybe not the 
latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we 
should be more conservative rather than less. Today, there are very likely more 
Spark programs running than there were at any other time in the past. Spark is 
no longer a tool only used by advanced hackers, it is now also running 
"traditional enterprise workloads.'' In many cases these jobs are powering 
important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be 
even harder for users, as if the library has not been upgraded to use new APIs 
and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to 
our semantic versioning policy. After discussion and if people agree this is a 
good idea, I'll call a vote of the PMC to ratify its inclusion in the official 
policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, 
even at major versions. While this is not always possible, the balance of the 
following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A 
broken API means that Spark programs need to be rewritten before they can be 
upgraded. However, there are a few considerations when thinking about 

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Sean Owen
Those are all quite reasonable guidelines and I'd put them into the
contributing or developer guide, sure.
Although not argued here, I think we should go further than codifying
and enforcing common-sense guidelines like these. I think bias should
shift in favor of retaining APIs going forward, and even retroactively
shift for 3.0 somewhat. (Hence some reverts currently in progress.)
It's a natural evolution from 1.x to 2.x to 3.x. The API surface area
stops expanding and changing and getting fixed as much; years more
experience prove out what APIs make sense.

On Mon, Feb 24, 2020 at 5:03 PM Michael Armbrust  wrote:
>
> Hello Everyone,
>
>
> As more users have started upgrading to Spark 3.0 preview (including myself), 
> there have been many discussions around APIs that have been broken compared 
> with Spark 2.x. In many of these discussions, one of the rationales for 
> breaking an API seems to be "Spark follows semantic versioning, so this major 
> release is our chance to get it right [by breaking APIs]". Similarly, in many 
> cases the response to questions about why an API was completely removed has 
> been, "this API has been deprecated since x.x, so we have to remove it".
>
>
> As a long time contributor to and user of Spark this interpretation of the 
> policy is concerning to me. This reasoning misses the intention of the 
> original policy, and I am worried that it will hurt the long-term success of 
> the project.
>
>
> I definitely understand that these are hard decisions, and I'm not proposing 
> that we never remove anything from Spark. However, I would like to give some 
> additional context and also propose a different rubric for thinking about API 
> breakage moving forward.
>
>
> Spark adopted semantic versioning back in 2014 during the preparations for 
> the 1.0 release. As this was the first major release -- and as, up until 
> fairly recently, Spark had only been an academic project -- no real promises 
> had been made about API stability ever.
>
>
> During the discussion, some committers suggested that this was an opportunity 
> to clean up cruft and give the Spark APIs a once-over, making cosmetic 
> changes to improve consistency. However, in the end, it was decided that in 
> many cases it was not in the best interests of the Spark community to break 
> things just because we could. Matei actually said it pretty forcefully:
>
>
> I know that some names are suboptimal, but I absolutely detest breaking APIs, 
> config names, etc. I’ve seen it happen way too often in other projects (even 
> things we depend on that are officially post-1.0, like Akka or Protobuf or 
> Hadoop), and it’s very painful. I think that we as fairly cutting-edge users 
> are okay with libraries occasionally changing, but many others will consider 
> it a show-stopper. Given this, I think that any cosmetic change now, even 
> though it might improve clarity slightly, is not worth the tradeoff in terms 
> of creating an update barrier for existing users.
>
>
> In the end, while some changes were made, most APIs remained the same and 
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this 
> served the project very well, as compatibility means users are able to 
> upgrade and we keep as many people on the latest versions of Spark (though 
> maybe not the latest APIs of Spark) as possible.
>
>
> As Spark grows, I think compatibility actually becomes more important and we 
> should be more conservative rather than less. Today, there are very likely 
> more Spark programs running than there were at any other time in the past. 
> Spark is no longer a tool only used by advanced hackers, it is now also 
> running "traditional enterprise workloads.'' In many cases these jobs are 
> powering important processes long after the original author leaves.
>
>
> Broken APIs can also affect libraries that extend Spark. This dependency can 
> be even harder for users, as if the library has not been upgraded to use new 
> APIs and they need that library, they are stuck.
>
>
> Given all of this, I'd like to propose the following rubric as an addition to 
> our semantic versioning policy. After discussion and if people agree this is 
> a good idea, I'll call a vote of the PMC to ratify its inclusion in the 
> official policy.
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing 
> behavior, even at major versions. While this is not always possible, the 
> balance of the following factors should be considered before choosing to 
> break an API.
>
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of Spark. A 
> broken API means that Spark programs need to be rewritten before they can be 
> upgraded. However, there are a few considerations when thinking about what 
> the cost will be:
>
> Usage - an API that is actively used in many different places, is always very 
> costly to break. While it is hard