So there are some minor things (the Where section heading appears to be dropped; wherever this document is posted it needs to actually link to a jira filter showing current / past SIPs) but it doesn't look like I can comment on the google doc.
The major substantive issue that I have is that this version is significantly less clear as to the outcome of an SIP. The apache example of lazy consensus at http://apache.org/foundation/voting.html#LazyConsensus involves an explicit announcement of an explicit deadline, which I think are necessary for clarity. On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> wrote: > It turned out suggested edits (trackable) don't show up for non-owners, so > I've just merged all the edits in place. It should be visible now. > > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> wrote: >> >> Oops. Let me try figure that out. >> >> >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote: >>> >>> Thanks for picking up on this. >>> >>> Maybe I fail at google docs, but I can't see any edits on the document >>> you linked. >>> >>> Regarding lazy consensus, if the board in general has less of an issue >>> with that, sure. As long as it is clearly announced, lasts at least >>> 72 hours, and has a clear outcome. >>> >>> The other points are hard to comment on without being able to see the >>> text in question. >>> >>> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote: >>> > I just looked through the entire thread again tonight - there are a lot >>> > of >>> > great ideas being discussed. Thanks Cody for taking the first crack at >>> > the >>> > proposal. >>> > >>> > I want to first comment on the context. Spark is one of the most >>> > innovative >>> > and important projects in (big) data -- overall technical decisions >>> > made in >>> > Apache Spark are sound. But of course, a project as large and active as >>> > Spark always have room for improvement, and we as a community should >>> > strive >>> > to take it to the next level. >>> > >>> > To that end, the two biggest areas for improvements in my opinion are: >>> > >>> > 1. Visibility: There are so much happening that it is difficult to know >>> > what >>> > really is going on. For people that don't follow closely, it is >>> > difficult to >>> > know what the important initiatives are. Even for people that do >>> > follow, it >>> > is difficult to know what specific things require their attention, >>> > since the >>> > number of pull requests and JIRA tickets are high and it's difficult to >>> > extract signal from noise. >>> > >>> > 2. Solicit user (broadly defined, including developers themselves) >>> > input >>> > more proactively: At the end of the day the project provides value >>> > because >>> > users use it. Users can't tell us exactly what to build, but it is >>> > important >>> > to get their inputs. >>> > >>> > >>> > I've taken Cody's doc and edited it: >>> > >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b >>> > (I've made all my modifications trackable) >>> > >>> > There are couple high level changes I made: >>> > >>> > 1. I've consulted a board member and he recommended lazy consensus as >>> > opposed to voting. The reason being in voting there can easily be a >>> > "loser' >>> > that gets outvoted. >>> > >>> > 2. I made it lighter weight, and renamed "strategy" to "optional design >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from >>> > tagging >>> > things and linking them elsewhere simply having design docs and >>> > prototypes >>> > implementations in PRs is not something that has not worked so far". >>> > >>> > 3. I made some the language tweaks to focus more on visibility. For >>> > example, >>> > "The purpose of an SIP is to inform and involve", rather than just >>> > "involve". SIPs should also have at least two emails that go to dev@. >>> > >>> > >>> > While I was editing this, I thought we really needed a suggested >>> > template >>> > for design doc too. I will get to that too ... >>> > >>> > >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> >>> > wrote: >>> >> >>> >> Most things looked OK to me too, although I do plan to take a closer >>> >> look >>> >> after Nov 1st when we cut the release branch for 2.1. >>> >> >>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com> >>> >> wrote: >>> >>> >>> >>> The proposal looks OK to me. I assume, even though it's not >>> >>> explicitly >>> >>> called, that voting would happen by e-mail? A template for the >>> >>> proposal document (instead of just a bullet nice) would also be nice, >>> >>> but that can be done at any time. >>> >>> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate >>> >>> for a SIP, given the scope of the work. The document attached even >>> >>> somewhat matches the proposed format. So if anyone wants to try out >>> >>> the process... >>> >>> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org> >>> >>> wrote: >>> >>> > Now that spark summit europe is over, are any committers interested >>> >>> > in >>> >>> > moving forward with this? >>> >>> > >>> >>> > >>> >>> > >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md >>> >>> > >>> >>> > Or are we going to let this discussion die on the vine? >>> >>> > >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >>> >>> > <tomasz.gaw...@outlook.com> wrote: >>> >>> >> Maybe my mail was not clear enough. >>> >>> >> >>> >>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other >>> >>> >> framework. >>> >>> >> The >>> >>> >> idea with benchmarks was to show two things: >>> >>> >> >>> >>> >> - why some people are doing bad PR for Spark >>> >>> >> >>> >>> >> - how - in easy way - we can change it and show that Spark is >>> >>> >> still on >>> >>> >> the >>> >>> >> top >>> >>> >> >>> >>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think >>> >>> >> they're the >>> >>> >> most important thing in Spark :) On the Spark main page there is >>> >>> >> still >>> >>> >> chart >>> >>> >> "Spark vs Hadoop". It is important to show that framework is not >>> >>> >> the >>> >>> >> same >>> >>> >> Spark with other API, but much faster and optimized, comparable or >>> >>> >> even >>> >>> >> faster than other frameworks. >>> >>> >> >>> >>> >> >>> >>> >> About real-time streaming, I think it would be just good to see it >>> >>> >> in >>> >>> >> Spark. >>> >>> >> I very like current Spark model, but many voices that says "we >>> >>> >> need >>> >>> >> more" - >>> >>> >> community should listen also them and try to help them. With SIPs >>> >>> >> it >>> >>> >> would >>> >>> >> be easier, I've just posted this example as "thing that may be >>> >>> >> changed >>> >>> >> with >>> >>> >> SIP". >>> >>> >> >>> >>> >> >>> >>> >> I very like unification via Datasets, but there is a lot of >>> >>> >> algorithms >>> >>> >> inside - let's make easy API, but with strong background >>> >>> >> (articles, >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still >>> >>> >> modern >>> >>> >> framework. >>> >>> >> >>> >>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said organizational >>> >>> >> ideas >>> >>> >> were already mentioned and I agree with them, my mail was just to >>> >>> >> show >>> >>> >> some >>> >>> >> aspects from my side, so from theside of developer and person who >>> >>> >> is >>> >>> >> trying >>> >>> >> to help others with Spark (via StackOverflow or other ways) >>> >>> >> >>> >>> >> >>> >>> >> Pozdrawiam / Best regards, >>> >>> >> >>> >>> >> Tomasz >>> >>> >> >>> >>> >> >>> >>> >> ________________________________ >>> >>> >> Od: Cody Koeninger <c...@koeninger.org> >>> >>> >> Wysłane: 17 października 2016 16:46 >>> >>> >> Do: Debasish Das >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org >>> >>> >> Temat: Re: Spark Improvement Proposals >>> >>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my >>> >>> >> point. >>> >>> >> >>> >>> >> My point is evolve or die. Spark's governance and organization is >>> >>> >> hampering its ability to evolve technologically, and it needs to >>> >>> >> change. >>> >>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >>> >>> >> <debasish.da...@gmail.com> >>> >>> >> wrote: >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in >>> >>> >>> 2014 >>> >>> >>> as >>> >>> >>> soon as I looked into it since compared to writing Java >>> >>> >>> map-reduce >>> >>> >>> and >>> >>> >>> Cascading code, Spark made writing distributed code fun...But now >>> >>> >>> as >>> >>> >>> we >>> >>> >>> went >>> >>> >>> deeper with Spark and real-time streaming use-case gets more >>> >>> >>> prominent, I >>> >>> >>> think it is time to bring a messaging model in conjunction with >>> >>> >>> the >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close >>> >>> >>> integration with spark micro-batching APIs looks like a great >>> >>> >>> direction to >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated >>> >>> >>> streaming >>> >>> >>> with >>> >>> >>> batch with the assumption is that micro-batching is sufficient to >>> >>> >>> run >>> >>> >>> SQL >>> >>> >>> commands on stream but do we really have time to do SQL >>> >>> >>> processing at >>> >>> >>> streaming data within 1-2 seconds ? >>> >>> >>> >>> >>> >>> After reading the email chain, I started to look into Flink >>> >>> >>> documentation >>> >>> >>> and if you compare it with Spark documentation, I think we have >>> >>> >>> major >>> >>> >>> work >>> >>> >>> to do detailing out Spark internals so that more people from >>> >>> >>> community >>> >>> >>> start >>> >>> >>> to take active role in improving the issues so that Spark stays >>> >>> >>> strong >>> >>> >>> compared to Flink. >>> >>> >>> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals >>> >>> >>> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals >>> >>> >>> >>> >>> >>> Spark is no longer an engine that works for micro-batch and >>> >>> >>> batch...We >>> >>> >>> (and >>> >>> >>> I am sure many others) are pushing spark as an engine for stream >>> >>> >>> and >>> >>> >>> query >>> >>> >>> processing.....we need to make it a state-of-the-art engine for >>> >>> >>> high >>> >>> >>> speed >>> >>> >>> streaming data and user queries as well ! >>> >>> >>> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >>> >>> >>> <tomasz.gaw...@outlook.com> >>> >>> >>> wrote: >>> >>> >>>> >>> >>> >>>> Hi everyone, >>> >>> >>>> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may >>> >>> >>>> help a >>> >>> >>>> little bit. :) Many technical and organizational topics were >>> >>> >>>> mentioned, >>> >>> >>>> but I want to focus on these negative posts about Spark and >>> >>> >>>> about >>> >>> >>>> "haters" >>> >>> >>>> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community - >>> >>> >>>> it's >>> >>> >>>> everything here. But Every project has to "flight" on "framework >>> >>> >>>> market" >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data >>> >>> >>>> communities, >>> >>> >>>> maybe my mail will inspire someone :) >>> >>> >>>> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time to >>> >>> >>>> join >>> >>> >>>> contributing to Spark) has done excellent job. So why are some >>> >>> >>>> people >>> >>> >>>> saying that Flink (or other framework) is better, like it was >>> >>> >>>> posted >>> >>> >>>> in >>> >>> >>>> this mailing list? No, not because that framework is better in >>> >>> >>>> all >>> >>> >>>> cases.. In my opinion, many of these discussions where started >>> >>> >>>> after >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink >>> >>> >>>> vs >>> >>> >>>> ...." >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are >>> >>> >>>> sometimes >>> >>> >>>> saying nothing about other frameworks, Flink's users (often >>> >>> >>>> PMC's) >>> >>> >>>> are >>> >>> >>>> just posting same information about real-time streaming, about >>> >>> >>>> delta >>> >>> >>>> iterations, etc. It look smart and very often it is marked as an >>> >>> >>>> aswer, >>> >>> >>>> even if - in my opinion - there wasn't told all the truth. >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to >>> >>> >>>> perform >>> >>> >>>> huge >>> >>> >>>> performance test. Maybe some company, that supports Spark >>> >>> >>>> (Databricks, >>> >>> >>>> Cloudera? - just saying you're most visible in community :) ) >>> >>> >>>> could >>> >>> >>>> perform performance test of: >>> >>> >>>> >>> >>> >>>> - streaming engine - probably Spark will loose because of >>> >>> >>>> mini-batch >>> >>> >>>> model, however currently the difference should be much lower >>> >>> >>>> that in >>> >>> >>>> previous versions >>> >>> >>>> >>> >>> >>>> - Machine Learning models >>> >>> >>>> >>> >>> >>>> - batch jobs >>> >>> >>>> >>> >>> >>>> - Graph jobs >>> >>> >>>> >>> >>> >>>> - SQL queries >>> >>> >>>> >>> >>> >>>> People will see that Spark is envolving and is also a modern >>> >>> >>>> framework, >>> >>> >>>> because after reading posts mentioned above people may think "it >>> >>> >>>> is >>> >>> >>>> outdated, future is in framework X". >>> >>> >>>> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark >>> >>> >>>> Structured >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use >>> >>> >>>> and >>> >>> >>>> reliability. Performance tests, done in various environments (in >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey, >>> >>> >>>> you're >>> >>> >>>> telling that you're better, but Spark is still faster and is >>> >>> >>>> still >>> >>> >>>> getting even more fast!". This would be based on facts (just >>> >>> >>>> numbers), >>> >>> >>>> not opinions. It would be good for companies, for marketing >>> >>> >>>> puproses >>> >>> >>>> and >>> >>> >>>> for every Spark developer >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> Second: real-time streaming. I've written some time ago about >>> >>> >>>> real-time >>> >>> >>>> streaming support in Spark Structured Streaming. Some work >>> >>> >>>> should be >>> >>> >>>> done to make SSS more low-latency, but I think it's possible. >>> >>> >>>> Maybe >>> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka? >>> >>> >>>> I >>> >>> >>>> don't >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark >>> >>> >>>> should >>> >>> >>>> have real-time streaming support. Currently I see many >>> >>> >>>> posts/comments >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very >>> >>> >>>> good >>> >>> >>>> jobs with micro-batches, however I think it is possible to add >>> >>> >>>> also >>> >>> >>>> more >>> >>> >>>> real-time processing. >>> >>> >>>> >>> >>> >>>> Other people said much more and I agree with proposal of SIP. >>> >>> >>>> I'm >>> >>> >>>> also >>> >>> >>>> happy that PMC's are not saying that they will not listen to >>> >>> >>>> users, >>> >>> >>>> but >>> >>> >>>> they really want to make Spark better for every user. >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> What do you think about these two topics? Especially I'm looking >>> >>> >>>> at >>> >>> >>>> Cody >>> >>> >>>> (who has started this topic) and PMCs :) >>> >>> >>>> >>> >>> >>>> Pozdrawiam / Best regards, >>> >>> >>>> >>> >>> >>>> Tomasz >>> >>> >>>> >>> >>> >>>> >>> >>> >>> >> >>> > >>> > > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org