Most things looked OK to me too, although I do plan to take a closer look after Nov 1st when we cut the release branch for 2.1.
On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > The proposal looks OK to me. I assume, even though it's not explicitly > called, that voting would happen by e-mail? A template for the > proposal document (instead of just a bullet nice) would also be nice, > but that can be done at any time. > > BTW, shameless plug: I filed SPARK-18085 which I consider a candidate > for a SIP, given the scope of the work. The document attached even > somewhat matches the proposed format. So if anyone wants to try out > the process... > > On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org> > wrote: > > Now that spark summit europe is over, are any committers interested in > > moving forward with this? > > > > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark- > improvement-proposals.md > > > > Or are we going to let this discussion die on the vine? > > > > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda > > <tomasz.gaw...@outlook.com> wrote: > >> Maybe my mail was not clear enough. > >> > >> > >> I didn't want to write "lets focus on Flink" or any other framework. The > >> idea with benchmarks was to show two things: > >> > >> - why some people are doing bad PR for Spark > >> > >> - how - in easy way - we can change it and show that Spark is still on > the > >> top > >> > >> > >> No more, no less. Benchmarks will be helpful, but I don't think they're > the > >> most important thing in Spark :) On the Spark main page there is still > chart > >> "Spark vs Hadoop". It is important to show that framework is not the > same > >> Spark with other API, but much faster and optimized, comparable or even > >> faster than other frameworks. > >> > >> > >> About real-time streaming, I think it would be just good to see it in > Spark. > >> I very like current Spark model, but many voices that says "we need > more" - > >> community should listen also them and try to help them. With SIPs it > would > >> be easier, I've just posted this example as "thing that may be changed > with > >> SIP". > >> > >> > >> I very like unification via Datasets, but there is a lot of algorithms > >> inside - let's make easy API, but with strong background (articles, > >> benchmarks, descriptions, etc) that shows that Spark is still modern > >> framework. > >> > >> > >> Maybe now my intention will be clearer :) As I said organizational ideas > >> were already mentioned and I agree with them, my mail was just to show > some > >> aspects from my side, so from theside of developer and person who is > trying > >> to help others with Spark (via StackOverflow or other ways) > >> > >> > >> Pozdrawiam / Best regards, > >> > >> Tomasz > >> > >> > >> ________________________________ > >> Od: Cody Koeninger <c...@koeninger.org> > >> Wysłane: 17 października 2016 16:46 > >> Do: Debasish Das > >> DW: Tomasz Gawęda; dev@spark.apache.org > >> Temat: Re: Spark Improvement Proposals > >> > >> I think narrowly focusing on Flink or benchmarks is missing my point. > >> > >> My point is evolve or die. Spark's governance and organization is > >> hampering its ability to evolve technologically, and it needs to > >> change. > >> > >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com > > > >> wrote: > >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 > as > >>> soon as I looked into it since compared to writing Java map-reduce and > >>> Cascading code, Spark made writing distributed code fun...But now as we > >>> went > >>> deeper with Spark and real-time streaming use-case gets more > prominent, I > >>> think it is time to bring a messaging model in conjunction with the > >>> batch/micro-batch API that Spark is good at....akka-streams close > >>> integration with spark micro-batching APIs looks like a great > direction to > >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming > with > >>> batch with the assumption is that micro-batching is sufficient to run > SQL > >>> commands on stream but do we really have time to do SQL processing at > >>> streaming data within 1-2 seconds ? > >>> > >>> After reading the email chain, I started to look into Flink > documentation > >>> and if you compare it with Spark documentation, I think we have major > work > >>> to do detailing out Spark internals so that more people from community > >>> start > >>> to take active role in improving the issues so that Spark stays strong > >>> compared to Flink. > >>> > >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals > >>> > >>> Spark is no longer an engine that works for micro-batch and batch...We > >>> (and > >>> I am sure many others) are pushing spark as an engine for stream and > query > >>> processing.....we need to make it a state-of-the-art engine for high > speed > >>> streaming data and user queries as well ! > >>> > >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda < > tomasz.gaw...@outlook.com> > >>> wrote: > >>>> > >>>> Hi everyone, > >>>> > >>>> I'm quite late with my answer, but I think my suggestions may help a > >>>> little bit. :) Many technical and organizational topics were > mentioned, > >>>> but I want to focus on these negative posts about Spark and about > >>>> "haters" > >>>> > >>>> I really like Spark. Easy of use, speed, very good community - it's > >>>> everything here. But Every project has to "flight" on "framework > market" > >>>> to be still no 1. I'm following many Spark and Big Data communities, > >>>> maybe my mail will inspire someone :) > >>>> > >>>> You (every Spark developer; so far I didn't have enough time to join > >>>> contributing to Spark) has done excellent job. So why are some people > >>>> saying that Flink (or other framework) is better, like it was posted > in > >>>> this mailing list? No, not because that framework is better in all > >>>> cases.. In my opinion, many of these discussions where started after > >>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs > ...." > >>>> posts, almost every post in "winned" by Flink. Answers are sometimes > >>>> saying nothing about other frameworks, Flink's users (often PMC's) are > >>>> just posting same information about real-time streaming, about delta > >>>> iterations, etc. It look smart and very often it is marked as an > aswer, > >>>> even if - in my opinion - there wasn't told all the truth. > >>>> > >>>> > >>>> My suggestion: I don't have enough money and knowledgle to perform > huge > >>>> performance test. Maybe some company, that supports Spark (Databricks, > >>>> Cloudera? - just saying you're most visible in community :) ) could > >>>> perform performance test of: > >>>> > >>>> - streaming engine - probably Spark will loose because of mini-batch > >>>> model, however currently the difference should be much lower that in > >>>> previous versions > >>>> > >>>> - Machine Learning models > >>>> > >>>> - batch jobs > >>>> > >>>> - Graph jobs > >>>> > >>>> - SQL queries > >>>> > >>>> People will see that Spark is envolving and is also a modern > framework, > >>>> because after reading posts mentioned above people may think "it is > >>>> outdated, future is in framework X". > >>>> > >>>> Matei Zaharia posted excellent blog post about how Spark Structured > >>>> Streaming beats every other framework in terms of easy-of-use and > >>>> reliability. Performance tests, done in various environments (in > >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node > >>>> cluster), could be also very good marketing stuff to say "hey, you're > >>>> telling that you're better, but Spark is still faster and is still > >>>> getting even more fast!". This would be based on facts (just numbers), > >>>> not opinions. It would be good for companies, for marketing puproses > and > >>>> for every Spark developer > >>>> > >>>> > >>>> Second: real-time streaming. I've written some time ago about > real-time > >>>> streaming support in Spark Structured Streaming. Some work should be > >>>> done to make SSS more low-latency, but I think it's possible. Maybe > >>>> Spark may look at Gearpump, which is also built on top of Akka? I > don't > >>>> know yet, it is good topic for SIP. However I think that Spark should > >>>> have real-time streaming support. Currently I see many posts/comments > >>>> that "Spark has too big latency". Spark Streaming is doing very good > >>>> jobs with micro-batches, however I think it is possible to add also > more > >>>> real-time processing. > >>>> > >>>> Other people said much more and I agree with proposal of SIP. I'm also > >>>> happy that PMC's are not saying that they will not listen to users, > but > >>>> they really want to make Spark better for every user. > >>>> > >>>> > >>>> What do you think about these two topics? Especially I'm looking at > Cody > >>>> (who has started this topic) and PMCs :) > >>>> > >>>> Pozdrawiam / Best regards, > >>>> > >>>> Tomasz > >>>> > >>>> > >>>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze: > >>>> > I love Spark. 3 or 4 years ago it was the first distributed > computing > >>>> > environment that felt usable, and the community was welcoming. > >>>> > > >>>> > But I just got back from the Reactive Summit, and this is what I > >>>> > observed: > >>>> > > >>>> > - Industry leaders on stage making fun of Spark's streaming model > >>>> > - Open source project leaders saying they looked at Spark's > governance > >>>> > as a model to avoid > >>>> > - Users saying they chose Flink because it was technically superior > >>>> > and they couldn't get any answers on the Spark mailing lists > >>>> > > >>>> > Whether you agree with the substance of any of this, when this stuff > >>>> > gets repeated enough people will believe it. > >>>> > > >>>> > Right now Spark is suffering from its own success, and I think > >>>> > something needs to change. > >>>> > > >>>> > - We need a clear process for planning significant changes to the > >>>> > codebase. > >>>> > I'm not saying you need to adopt Kafka Improvement Proposals > exactly, > >>>> > but you need a documented process with a clear outcome (e.g. a > vote). > >>>> > Passing around google docs after an implementation has largely been > >>>> > decided on doesn't cut it. > >>>> > > >>>> > - All technical communication needs to be public. > >>>> > Things getting decided in private chat, or when 1/3 of the > committers > >>>> > work for the same company and can just talk to each other... > >>>> > Yes, it's convenient, but it's ultimately detrimental to the health > of > >>>> > the project. > >>>> > The way structured streaming has played out has shown that there are > >>>> > significant technical blind spots (myself included). > >>>> > One way to address that is to get the people who have domain > knowledge > >>>> > involved, and listen to them. > >>>> > > >>>> > - We need more committers, and more committer diversity. > >>>> > Per committer there are, what, more than 20 contributors and 10 new > >>>> > jira tickets a month? It's too much. > >>>> > There are people (I am _not_ referring to myself) who have been > around > >>>> > for years, contributed thousands of lines of code, helped educate > the > >>>> > public around Spark... and yet are never going to be voted in. > >>>> > > >>>> > - We need a clear process for managing volunteer work. > >>>> > Too many tickets sit around unowned, unclosed, uncertain. > >>>> > If someone proposed something and it isn't up to snuff, tell them > and > >>>> > close it. It may be blunt, but it's clearer than "silent no". > >>>> > If someone wants to work on something, let them own the ticket and > set > >>>> > a deadline. If they don't meet it, close it or reassign it. > >>>> > > >>>> > This is not me putting on an Apache Bureaucracy hat. This is me > >>>> > saying, as a fellow hacker and loyal dissenter, something is wrong > >>>> > with the culture and process. > >>>> > > >>>> > Please, let's change it. > >>>> > > >>>> > ------------------------------------------------------------ > --------- > >>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>>> > > >>> > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > > -- > Marcelo > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >