Now that spark summit europe is over, are any committers interested in moving forward with this?
https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md Or are we going to let this discussion die on the vine? On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda <tomasz.gaw...@outlook.com> wrote: > Maybe my mail was not clear enough. > > > I didn't want to write "lets focus on Flink" or any other framework. The > idea with benchmarks was to show two things: > > - why some people are doing bad PR for Spark > > - how - in easy way - we can change it and show that Spark is still on the > top > > > No more, no less. Benchmarks will be helpful, but I don't think they're the > most important thing in Spark :) On the Spark main page there is still chart > "Spark vs Hadoop". It is important to show that framework is not the same > Spark with other API, but much faster and optimized, comparable or even > faster than other frameworks. > > > About real-time streaming, I think it would be just good to see it in Spark. > I very like current Spark model, but many voices that says "we need more" - > community should listen also them and try to help them. With SIPs it would > be easier, I've just posted this example as "thing that may be changed with > SIP". > > > I very like unification via Datasets, but there is a lot of algorithms > inside - let's make easy API, but with strong background (articles, > benchmarks, descriptions, etc) that shows that Spark is still modern > framework. > > > Maybe now my intention will be clearer :) As I said organizational ideas > were already mentioned and I agree with them, my mail was just to show some > aspects from my side, so from theside of developer and person who is trying > to help others with Spark (via StackOverflow or other ways) > > > Pozdrawiam / Best regards, > > Tomasz > > > ________________________________ > Od: Cody Koeninger <c...@koeninger.org> > Wysłane: 17 października 2016 16:46 > Do: Debasish Das > DW: Tomasz Gawęda; dev@spark.apache.org > Temat: Re: Spark Improvement Proposals > > I think narrowly focusing on Flink or benchmarks is missing my point. > > My point is evolve or die. Spark's governance and organization is > hampering its ability to evolve technologically, and it needs to > change. > > On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com> > wrote: >> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as >> soon as I looked into it since compared to writing Java map-reduce and >> Cascading code, Spark made writing distributed code fun...But now as we >> went >> deeper with Spark and real-time streaming use-case gets more prominent, I >> think it is time to bring a messaging model in conjunction with the >> batch/micro-batch API that Spark is good at....akka-streams close >> integration with spark micro-batching APIs looks like a great direction to >> stay in the game with Apache Flink...Spark 2.0 integrated streaming with >> batch with the assumption is that micro-batching is sufficient to run SQL >> commands on stream but do we really have time to do SQL processing at >> streaming data within 1-2 seconds ? >> >> After reading the email chain, I started to look into Flink documentation >> and if you compare it with Spark documentation, I think we have major work >> to do detailing out Spark internals so that more people from community >> start >> to take active role in improving the issues so that Spark stays strong >> compared to Flink. >> >> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals >> >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals >> >> Spark is no longer an engine that works for micro-batch and batch...We >> (and >> I am sure many others) are pushing spark as an engine for stream and query >> processing.....we need to make it a state-of-the-art engine for high speed >> streaming data and user queries as well ! >> >> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <tomasz.gaw...@outlook.com> >> wrote: >>> >>> Hi everyone, >>> >>> I'm quite late with my answer, but I think my suggestions may help a >>> little bit. :) Many technical and organizational topics were mentioned, >>> but I want to focus on these negative posts about Spark and about >>> "haters" >>> >>> I really like Spark. Easy of use, speed, very good community - it's >>> everything here. But Every project has to "flight" on "framework market" >>> to be still no 1. I'm following many Spark and Big Data communities, >>> maybe my mail will inspire someone :) >>> >>> You (every Spark developer; so far I didn't have enough time to join >>> contributing to Spark) has done excellent job. So why are some people >>> saying that Flink (or other framework) is better, like it was posted in >>> this mailing list? No, not because that framework is better in all >>> cases.. In my opinion, many of these discussions where started after >>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...." >>> posts, almost every post in "winned" by Flink. Answers are sometimes >>> saying nothing about other frameworks, Flink's users (often PMC's) are >>> just posting same information about real-time streaming, about delta >>> iterations, etc. It look smart and very often it is marked as an aswer, >>> even if - in my opinion - there wasn't told all the truth. >>> >>> >>> My suggestion: I don't have enough money and knowledgle to perform huge >>> performance test. Maybe some company, that supports Spark (Databricks, >>> Cloudera? - just saying you're most visible in community :) ) could >>> perform performance test of: >>> >>> - streaming engine - probably Spark will loose because of mini-batch >>> model, however currently the difference should be much lower that in >>> previous versions >>> >>> - Machine Learning models >>> >>> - batch jobs >>> >>> - Graph jobs >>> >>> - SQL queries >>> >>> People will see that Spark is envolving and is also a modern framework, >>> because after reading posts mentioned above people may think "it is >>> outdated, future is in framework X". >>> >>> Matei Zaharia posted excellent blog post about how Spark Structured >>> Streaming beats every other framework in terms of easy-of-use and >>> reliability. Performance tests, done in various environments (in >>> example: laptop, small 2 node cluster, 10-node cluster, 20-node >>> cluster), could be also very good marketing stuff to say "hey, you're >>> telling that you're better, but Spark is still faster and is still >>> getting even more fast!". This would be based on facts (just numbers), >>> not opinions. It would be good for companies, for marketing puproses and >>> for every Spark developer >>> >>> >>> Second: real-time streaming. I've written some time ago about real-time >>> streaming support in Spark Structured Streaming. Some work should be >>> done to make SSS more low-latency, but I think it's possible. Maybe >>> Spark may look at Gearpump, which is also built on top of Akka? I don't >>> know yet, it is good topic for SIP. However I think that Spark should >>> have real-time streaming support. Currently I see many posts/comments >>> that "Spark has too big latency". Spark Streaming is doing very good >>> jobs with micro-batches, however I think it is possible to add also more >>> real-time processing. >>> >>> Other people said much more and I agree with proposal of SIP. I'm also >>> happy that PMC's are not saying that they will not listen to users, but >>> they really want to make Spark better for every user. >>> >>> >>> What do you think about these two topics? Especially I'm looking at Cody >>> (who has started this topic) and PMCs :) >>> >>> Pozdrawiam / Best regards, >>> >>> Tomasz >>> >>> >>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze: >>> > I love Spark. 3 or 4 years ago it was the first distributed computing >>> > environment that felt usable, and the community was welcoming. >>> > >>> > But I just got back from the Reactive Summit, and this is what I >>> > observed: >>> > >>> > - Industry leaders on stage making fun of Spark's streaming model >>> > - Open source project leaders saying they looked at Spark's governance >>> > as a model to avoid >>> > - Users saying they chose Flink because it was technically superior >>> > and they couldn't get any answers on the Spark mailing lists >>> > >>> > Whether you agree with the substance of any of this, when this stuff >>> > gets repeated enough people will believe it. >>> > >>> > Right now Spark is suffering from its own success, and I think >>> > something needs to change. >>> > >>> > - We need a clear process for planning significant changes to the >>> > codebase. >>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly, >>> > but you need a documented process with a clear outcome (e.g. a vote). >>> > Passing around google docs after an implementation has largely been >>> > decided on doesn't cut it. >>> > >>> > - All technical communication needs to be public. >>> > Things getting decided in private chat, or when 1/3 of the committers >>> > work for the same company and can just talk to each other... >>> > Yes, it's convenient, but it's ultimately detrimental to the health of >>> > the project. >>> > The way structured streaming has played out has shown that there are >>> > significant technical blind spots (myself included). >>> > One way to address that is to get the people who have domain knowledge >>> > involved, and listen to them. >>> > >>> > - We need more committers, and more committer diversity. >>> > Per committer there are, what, more than 20 contributors and 10 new >>> > jira tickets a month? It's too much. >>> > There are people (I am _not_ referring to myself) who have been around >>> > for years, contributed thousands of lines of code, helped educate the >>> > public around Spark... and yet are never going to be voted in. >>> > >>> > - We need a clear process for managing volunteer work. >>> > Too many tickets sit around unowned, unclosed, uncertain. >>> > If someone proposed something and it isn't up to snuff, tell them and >>> > close it. It may be blunt, but it's clearer than "silent no". >>> > If someone wants to work on something, let them own the ticket and set >>> > a deadline. If they don't meet it, close it or reassign it. >>> > >>> > This is not me putting on an Apache Bureaucracy hat. This is me >>> > saying, as a fellow hacker and loyal dissenter, something is wrong >>> > with the culture and process. >>> > >>> > Please, let's change it. >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> > >> >> > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org