Unsubscribe

2020-02-24 Thread chenzhihan


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Unsubscribe

2020-02-24 Thread Stepan Tuchin
Unsubscribe
-- 

[image: Brandmark_small.jpg]

Stepan Tuchin, Automation Quality Engineer

Grid Dynamics

Vavilova, 38/114, Saratov

Dir: +7 (902) 047-55-55


Unsubscribe

2020-02-24 Thread Sharanabasappa G Keriwaddi (Sharan, IT & Cloud BL R)





[Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-24 Thread Michael Armbrust
Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including
myself), there have been many discussions around APIs that have been broken
compared with Spark 2.x. In many of these discussions, one of the
rationales for breaking an API seems to be "Spark follows semantic
versioning , so this major
release is our chance to get it right [by breaking APIs]". Similarly, in
many cases the response to questions about why an API was completely
removed has been, "this API has been deprecated since x.x, so we have to
remove it".

As a long time contributor to and user of Spark this interpretation of the
policy is concerning to me. This reasoning misses the intention of the
original policy, and I am worried that it will hurt the long-term success
of the project.

I definitely understand that these are hard decisions, and I'm not
proposing that we never remove anything from Spark. However, I would like
to give some additional context and also propose a different rubric for
thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for
the 1.0 release. As this was the first major release -- and as, up until
fairly recently, Spark had only been an academic project -- no real
promises had been made about API stability ever.

During the discussion, some committers suggested that this was an
opportunity to clean up cruft and give the Spark APIs a once-over, making
cosmetic changes to improve consistency. However, in the end, it was
decided that in many cases it was not in the best interests of the Spark
community to break things just because we could. Matei actually said it
pretty forcefully

:

I know that some names are suboptimal, but I absolutely detest breaking
APIs, config names, etc. I’ve seen it happen way too often in other
projects (even things we depend on that are officially post-1.0, like Akka
or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
cutting-edge users are okay with libraries occasionally changing, but many
others will consider it a show-stopper. Given this, I think that any
cosmetic change now, even though it might improve clarity slightly, is not
worth the tradeoff in terms of creating an update barrier for existing
users.

In the end, while some changes were made, most APIs remained the same and
users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
this served the project very well, as compatibility means users are able to
upgrade and we keep as many people on the latest versions of Spark (though
maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and
we should be more conservative rather than less. Today, there are very
likely more Spark programs running than there were at any other time in the
past. Spark is no longer a tool only used by advanced hackers, it is now
also running "traditional enterprise workloads.'' In many cases these jobs
are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency
can be even harder for users, as if the library has not been upgraded to
use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition
to our semantic versioning policy. After discussion and if people agree
this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
the official policy.

Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing
behavior, even at major versions. While this is not always possible, the
balance of the following factors should be considered before choosing to
break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark.
A broken API means that Spark programs need to be rewritten before they can
be upgraded. However, there are a few considerations when thinking about
what the cost will be:

   -

   Usage - an API that is actively used in many different places, is always
   very costly to break. While it is hard to know usage for sure, there are a
   bunch of ways that we can estimate:
   -

  How long has the API been in Spark?
  -

  Is the API common even for basic programs?
  -

  How often do we see recent questions in JIRA or mailing lists?
  -

  How often does it appear in StackOverflow or blogs?
  -

   Behavior after the break - How will a program that works today, work
   after the break? The following are listed roughly in order of increasing
   severity:
   -

  Will there be a compiler or linker error?
  -

  Will there be a runtime exception?
  -

  Will that exception happen after significant 

Re: https://spark-project.atlassian.net/browse/SPARK-1153

2020-02-24 Thread kant kodali
Sorry please ignore this. I accidentally ran it with GraphX instead of
Graphframes.

I see the code here
https://github.com/graphframes/graphframes/blob/a30adaf53dece8c548d96c895ac330ecb3931451/src/main/scala/org/graphframes/GraphFrame.scala#L539-L555
Which indeed generates its own id! that's great!

Thanks

On Sun, Feb 23, 2020 at 3:53 PM kant kodali  wrote:

> Hi All,
>
> Any chance of fixing this one ?
> https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
> around may be?
>
> Currently, I got bunch of events streaming into kafka across various
> topics and they are stamped with an UUIDv1 for each event. so it is easy to
> construct edges using UUID. I am not quite sure how to generate a long
> based unique id without synchronization in a distributed setting. I had
> read this SO post
> 
>  which
> shows there are two ways one may be able to achieve this
>
> 1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE
>
> 2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
> 9223372036854251520L)
>
> However I am concerned about collisions and looking for the probability of
> collisions for the above two approaches. any suggestions?
>
> I ran the Connected Components algorithms using graphframes it runs well
> when long based id's are used but with string the performance drops
> significantly as pointed out in the ticket. I understand that algorithm
> depends on hashing integers heavily but I wonder why not fixed length
> byte[] ? that way we can convert any datatype to sequence of bytes.
>
> Thanks!
>