On lazy consensus as opposed to voting: First, why lazy consensus? The proposal was for consensus, which is at least three +1 votes and no vetos. Consensus has no losing side, it requires getting to a point where there is agreement. Isn't that agreement what we want to achieve with these proposals?
Second, lazy consensus only removes the requirement for three +1 votes. Why would we not want at least three committers to think something is a good idea before adopting the proposal? rb On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org> wrote: > So there are some minor things (the Where section heading appears to > be dropped; wherever this document is posted it needs to actually link > to a jira filter showing current / past SIPs) but it doesn't look like > I can comment on the google doc. > > The major substantive issue that I have is that this version is > significantly less clear as to the outcome of an SIP. > > The apache example of lazy consensus at > http://apache.org/foundation/voting.html#LazyConsensus involves an > explicit announcement of an explicit deadline, which I think are > necessary for clarity. > > > > On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> wrote: > > It turned out suggested edits (trackable) don't show up for non-owners, > so > > I've just merged all the edits in place. It should be visible now. > > > > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> > wrote: > >> > >> Oops. Let me try figure that out. > >> > >> > >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote: > >>> > >>> Thanks for picking up on this. > >>> > >>> Maybe I fail at google docs, but I can't see any edits on the document > >>> you linked. > >>> > >>> Regarding lazy consensus, if the board in general has less of an issue > >>> with that, sure. As long as it is clearly announced, lasts at least > >>> 72 hours, and has a clear outcome. > >>> > >>> The other points are hard to comment on without being able to see the > >>> text in question. > >>> > >>> > >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> > wrote: > >>> > I just looked through the entire thread again tonight - there are a > lot > >>> > of > >>> > great ideas being discussed. Thanks Cody for taking the first crack > at > >>> > the > >>> > proposal. > >>> > > >>> > I want to first comment on the context. Spark is one of the most > >>> > innovative > >>> > and important projects in (big) data -- overall technical decisions > >>> > made in > >>> > Apache Spark are sound. But of course, a project as large and active > as > >>> > Spark always have room for improvement, and we as a community should > >>> > strive > >>> > to take it to the next level. > >>> > > >>> > To that end, the two biggest areas for improvements in my opinion > are: > >>> > > >>> > 1. Visibility: There are so much happening that it is difficult to > know > >>> > what > >>> > really is going on. For people that don't follow closely, it is > >>> > difficult to > >>> > know what the important initiatives are. Even for people that do > >>> > follow, it > >>> > is difficult to know what specific things require their attention, > >>> > since the > >>> > number of pull requests and JIRA tickets are high and it's difficult > to > >>> > extract signal from noise. > >>> > > >>> > 2. Solicit user (broadly defined, including developers themselves) > >>> > input > >>> > more proactively: At the end of the day the project provides value > >>> > because > >>> > users use it. Users can't tell us exactly what to build, but it is > >>> > important > >>> > to get their inputs. > >>> > > >>> > > >>> > I've taken Cody's doc and edited it: > >>> > > >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x- > nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b > >>> > (I've made all my modifications trackable) > >>> > > >>> > There are couple high level changes I made: > >>> > > >>> > 1. I've consulted a board member and he recommended lazy consensus as > >>> > opposed to voting. The reason being in voting there can easily be a > >>> > "loser' > >>> > that gets outvoted. > >>> > > >>> > 2. I made it lighter weight, and renamed "strategy" to "optional > design > >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from > >>> > tagging > >>> > things and linking them elsewhere simply having design docs and > >>> > prototypes > >>> > implementations in PRs is not something that has not worked so far". > >>> > > >>> > 3. I made some the language tweaks to focus more on visibility. For > >>> > example, > >>> > "The purpose of an SIP is to inform and involve", rather than just > >>> > "involve". SIPs should also have at least two emails that go to dev@ > . > >>> > > >>> > > >>> > While I was editing this, I thought we really needed a suggested > >>> > template > >>> > for design doc too. I will get to that too ... > >>> > > >>> > > >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> > >>> > wrote: > >>> >> > >>> >> Most things looked OK to me too, although I do plan to take a closer > >>> >> look > >>> >> after Nov 1st when we cut the release branch for 2.1. > >>> >> > >>> >> > >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin < > van...@cloudera.com> > >>> >> wrote: > >>> >>> > >>> >>> The proposal looks OK to me. I assume, even though it's not > >>> >>> explicitly > >>> >>> called, that voting would happen by e-mail? A template for the > >>> >>> proposal document (instead of just a bullet nice) would also be > nice, > >>> >>> but that can be done at any time. > >>> >>> > >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a > candidate > >>> >>> for a SIP, given the scope of the work. The document attached even > >>> >>> somewhat matches the proposed format. So if anyone wants to try out > >>> >>> the process... > >>> >>> > >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger < > c...@koeninger.org> > >>> >>> wrote: > >>> >>> > Now that spark summit europe is over, are any committers > interested > >>> >>> > in > >>> >>> > moving forward with this? > >>> >>> > > >>> >>> > > >>> >>> > > >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark- > improvement-proposals.md > >>> >>> > > >>> >>> > Or are we going to let this discussion die on the vine? > >>> >>> > > >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda > >>> >>> > <tomasz.gaw...@outlook.com> wrote: > >>> >>> >> Maybe my mail was not clear enough. > >>> >>> >> > >>> >>> >> > >>> >>> >> I didn't want to write "lets focus on Flink" or any other > >>> >>> >> framework. > >>> >>> >> The > >>> >>> >> idea with benchmarks was to show two things: > >>> >>> >> > >>> >>> >> - why some people are doing bad PR for Spark > >>> >>> >> > >>> >>> >> - how - in easy way - we can change it and show that Spark is > >>> >>> >> still on > >>> >>> >> the > >>> >>> >> top > >>> >>> >> > >>> >>> >> > >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think > >>> >>> >> they're the > >>> >>> >> most important thing in Spark :) On the Spark main page there is > >>> >>> >> still > >>> >>> >> chart > >>> >>> >> "Spark vs Hadoop". It is important to show that framework is not > >>> >>> >> the > >>> >>> >> same > >>> >>> >> Spark with other API, but much faster and optimized, comparable > or > >>> >>> >> even > >>> >>> >> faster than other frameworks. > >>> >>> >> > >>> >>> >> > >>> >>> >> About real-time streaming, I think it would be just good to see > it > >>> >>> >> in > >>> >>> >> Spark. > >>> >>> >> I very like current Spark model, but many voices that says "we > >>> >>> >> need > >>> >>> >> more" - > >>> >>> >> community should listen also them and try to help them. With > SIPs > >>> >>> >> it > >>> >>> >> would > >>> >>> >> be easier, I've just posted this example as "thing that may be > >>> >>> >> changed > >>> >>> >> with > >>> >>> >> SIP". > >>> >>> >> > >>> >>> >> > >>> >>> >> I very like unification via Datasets, but there is a lot of > >>> >>> >> algorithms > >>> >>> >> inside - let's make easy API, but with strong background > >>> >>> >> (articles, > >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still > >>> >>> >> modern > >>> >>> >> framework. > >>> >>> >> > >>> >>> >> > >>> >>> >> Maybe now my intention will be clearer :) As I said > organizational > >>> >>> >> ideas > >>> >>> >> were already mentioned and I agree with them, my mail was just > to > >>> >>> >> show > >>> >>> >> some > >>> >>> >> aspects from my side, so from theside of developer and person > who > >>> >>> >> is > >>> >>> >> trying > >>> >>> >> to help others with Spark (via StackOverflow or other ways) > >>> >>> >> > >>> >>> >> > >>> >>> >> Pozdrawiam / Best regards, > >>> >>> >> > >>> >>> >> Tomasz > >>> >>> >> > >>> >>> >> > >>> >>> >> ________________________________ > >>> >>> >> Od: Cody Koeninger <c...@koeninger.org> > >>> >>> >> Wysłane: 17 października 2016 16:46 > >>> >>> >> Do: Debasish Das > >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org > >>> >>> >> Temat: Re: Spark Improvement Proposals > >>> >>> >> > >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my > >>> >>> >> point. > >>> >>> >> > >>> >>> >> My point is evolve or die. Spark's governance and organization > is > >>> >>> >> hampering its ability to evolve technologically, and it needs to > >>> >>> >> change. > >>> >>> >> > >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das > >>> >>> >> <debasish.da...@gmail.com> > >>> >>> >> wrote: > >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark > in > >>> >>> >>> 2014 > >>> >>> >>> as > >>> >>> >>> soon as I looked into it since compared to writing Java > >>> >>> >>> map-reduce > >>> >>> >>> and > >>> >>> >>> Cascading code, Spark made writing distributed code fun...But > now > >>> >>> >>> as > >>> >>> >>> we > >>> >>> >>> went > >>> >>> >>> deeper with Spark and real-time streaming use-case gets more > >>> >>> >>> prominent, I > >>> >>> >>> think it is time to bring a messaging model in conjunction with > >>> >>> >>> the > >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams > close > >>> >>> >>> integration with spark micro-batching APIs looks like a great > >>> >>> >>> direction to > >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated > >>> >>> >>> streaming > >>> >>> >>> with > >>> >>> >>> batch with the assumption is that micro-batching is sufficient > to > >>> >>> >>> run > >>> >>> >>> SQL > >>> >>> >>> commands on stream but do we really have time to do SQL > >>> >>> >>> processing at > >>> >>> >>> streaming data within 1-2 seconds ? > >>> >>> >>> > >>> >>> >>> After reading the email chain, I started to look into Flink > >>> >>> >>> documentation > >>> >>> >>> and if you compare it with Spark documentation, I think we have > >>> >>> >>> major > >>> >>> >>> work > >>> >>> >>> to do detailing out Spark internals so that more people from > >>> >>> >>> community > >>> >>> >>> start > >>> >>> >>> to take active role in improving the issues so that Spark stays > >>> >>> >>> strong > >>> >>> >>> compared to Flink. > >>> >>> >>> > >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/ > Spark+Internals > >>> >>> >>> > >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/ > Flink+Internals > >>> >>> >>> > >>> >>> >>> Spark is no longer an engine that works for micro-batch and > >>> >>> >>> batch...We > >>> >>> >>> (and > >>> >>> >>> I am sure many others) are pushing spark as an engine for > stream > >>> >>> >>> and > >>> >>> >>> query > >>> >>> >>> processing.....we need to make it a state-of-the-art engine for > >>> >>> >>> high > >>> >>> >>> speed > >>> >>> >>> streaming data and user queries as well ! > >>> >>> >>> > >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda > >>> >>> >>> <tomasz.gaw...@outlook.com> > >>> >>> >>> wrote: > >>> >>> >>>> > >>> >>> >>>> Hi everyone, > >>> >>> >>>> > >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may > >>> >>> >>>> help a > >>> >>> >>>> little bit. :) Many technical and organizational topics were > >>> >>> >>>> mentioned, > >>> >>> >>>> but I want to focus on these negative posts about Spark and > >>> >>> >>>> about > >>> >>> >>>> "haters" > >>> >>> >>>> > >>> >>> >>>> I really like Spark. Easy of use, speed, very good community - > >>> >>> >>>> it's > >>> >>> >>>> everything here. But Every project has to "flight" on > "framework > >>> >>> >>>> market" > >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data > >>> >>> >>>> communities, > >>> >>> >>>> maybe my mail will inspire someone :) > >>> >>> >>>> > >>> >>> >>>> You (every Spark developer; so far I didn't have enough time > to > >>> >>> >>>> join > >>> >>> >>>> contributing to Spark) has done excellent job. So why are some > >>> >>> >>>> people > >>> >>> >>>> saying that Flink (or other framework) is better, like it was > >>> >>> >>>> posted > >>> >>> >>>> in > >>> >>> >>>> this mailing list? No, not because that framework is better in > >>> >>> >>>> all > >>> >>> >>>> cases.. In my opinion, many of these discussions where started > >>> >>> >>>> after > >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow > "Flink > >>> >>> >>>> vs > >>> >>> >>>> ...." > >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are > >>> >>> >>>> sometimes > >>> >>> >>>> saying nothing about other frameworks, Flink's users (often > >>> >>> >>>> PMC's) > >>> >>> >>>> are > >>> >>> >>>> just posting same information about real-time streaming, about > >>> >>> >>>> delta > >>> >>> >>>> iterations, etc. It look smart and very often it is marked as > an > >>> >>> >>>> aswer, > >>> >>> >>>> even if - in my opinion - there wasn't told all the truth. > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to > >>> >>> >>>> perform > >>> >>> >>>> huge > >>> >>> >>>> performance test. Maybe some company, that supports Spark > >>> >>> >>>> (Databricks, > >>> >>> >>>> Cloudera? - just saying you're most visible in community :) ) > >>> >>> >>>> could > >>> >>> >>>> perform performance test of: > >>> >>> >>>> > >>> >>> >>>> - streaming engine - probably Spark will loose because of > >>> >>> >>>> mini-batch > >>> >>> >>>> model, however currently the difference should be much lower > >>> >>> >>>> that in > >>> >>> >>>> previous versions > >>> >>> >>>> > >>> >>> >>>> - Machine Learning models > >>> >>> >>>> > >>> >>> >>>> - batch jobs > >>> >>> >>>> > >>> >>> >>>> - Graph jobs > >>> >>> >>>> > >>> >>> >>>> - SQL queries > >>> >>> >>>> > >>> >>> >>>> People will see that Spark is envolving and is also a modern > >>> >>> >>>> framework, > >>> >>> >>>> because after reading posts mentioned above people may think > "it > >>> >>> >>>> is > >>> >>> >>>> outdated, future is in framework X". > >>> >>> >>>> > >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark > >>> >>> >>>> Structured > >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use > >>> >>> >>>> and > >>> >>> >>>> reliability. Performance tests, done in various environments > (in > >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, > 20-node > >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey, > >>> >>> >>>> you're > >>> >>> >>>> telling that you're better, but Spark is still faster and is > >>> >>> >>>> still > >>> >>> >>>> getting even more fast!". This would be based on facts (just > >>> >>> >>>> numbers), > >>> >>> >>>> not opinions. It would be good for companies, for marketing > >>> >>> >>>> puproses > >>> >>> >>>> and > >>> >>> >>>> for every Spark developer > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> Second: real-time streaming. I've written some time ago about > >>> >>> >>>> real-time > >>> >>> >>>> streaming support in Spark Structured Streaming. Some work > >>> >>> >>>> should be > >>> >>> >>>> done to make SSS more low-latency, but I think it's possible. > >>> >>> >>>> Maybe > >>> >>> >>>> Spark may look at Gearpump, which is also built on top of > Akka? > >>> >>> >>>> I > >>> >>> >>>> don't > >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark > >>> >>> >>>> should > >>> >>> >>>> have real-time streaming support. Currently I see many > >>> >>> >>>> posts/comments > >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing > very > >>> >>> >>>> good > >>> >>> >>>> jobs with micro-batches, however I think it is possible to add > >>> >>> >>>> also > >>> >>> >>>> more > >>> >>> >>>> real-time processing. > >>> >>> >>>> > >>> >>> >>>> Other people said much more and I agree with proposal of SIP. > >>> >>> >>>> I'm > >>> >>> >>>> also > >>> >>> >>>> happy that PMC's are not saying that they will not listen to > >>> >>> >>>> users, > >>> >>> >>>> but > >>> >>> >>>> they really want to make Spark better for every user. > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> What do you think about these two topics? Especially I'm > looking > >>> >>> >>>> at > >>> >>> >>>> Cody > >>> >>> >>>> (who has started this topic) and PMCs :) > >>> >>> >>>> > >>> >>> >>>> Pozdrawiam / Best regards, > >>> >>> >>>> > >>> >>> >>>> Tomasz > >>> >>> >>>> > >>> >>> >>>> > >>> >>> > >>> >> > >>> > > >>> > > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix