Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Cody Koeninger
Thanks for picking up on this.

Maybe I fail at google docs, but I can't see any edits on the document
you linked.

Regarding lazy consensus, if the board in general has less of an issue
with that, sure.  As long as it is clearly announced, lasts at least
72 hours, and has a clear outcome.

The other points are hard to comment on without being able to see the
text in question.


On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin  wrote:
> I just looked through the entire thread again tonight - there are a lot of
> great ideas being discussed. Thanks Cody for taking the first crack at the
> proposal.
>
> I want to first comment on the context. Spark is one of the most innovative
> and important projects in (big) data -- overall technical decisions made in
> Apache Spark are sound. But of course, a project as large and active as
> Spark always have room for improvement, and we as a community should strive
> to take it to the next level.
>
> To that end, the two biggest areas for improvements in my opinion are:
>
> 1. Visibility: There are so much happening that it is difficult to know what
> really is going on. For people that don't follow closely, it is difficult to
> know what the important initiatives are. Even for people that do follow, it
> is difficult to know what specific things require their attention, since the
> number of pull requests and JIRA tickets are high and it's difficult to
> extract signal from noise.
>
> 2. Solicit user (broadly defined, including developers themselves) input
> more proactively: At the end of the day the project provides value because
> users use it. Users can't tell us exactly what to build, but it is important
> to get their inputs.
>
>
> I've taken Cody's doc and edited it:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> (I've made all my modifications trackable)
>
> There are couple high level changes I made:
>
> 1. I've consulted a board member and he recommended lazy consensus as
> opposed to voting. The reason being in voting there can easily be a "loser'
> that gets outvoted.
>
> 2. I made it lighter weight, and renamed "strategy" to "optional design
> sketch". Echoing one of the earlier email: "IMHO so far aside from tagging
> things and linking them elsewhere simply having design docs and prototypes
> implementations in PRs is not something that has not worked so far".
>
> 3. I made some the language tweaks to focus more on visibility. For example,
> "The purpose of an SIP is to inform and involve", rather than just
> "involve". SIPs should also have at least two emails that go to dev@.
>
>
> While I was editing this, I thought we really needed a suggested template
> for design doc too. I will get to that too ...
>
>
> On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin  wrote:
>>
>> Most things looked OK to me too, although I do plan to take a closer look
>> after Nov 1st when we cut the release branch for 2.1.
>>
>>
>> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin 
>> wrote:
>>>
>>> The proposal looks OK to me. I assume, even though it's not explicitly
>>> called, that voting would happen by e-mail? A template for the
>>> proposal document (instead of just a bullet nice) would also be nice,
>>> but that can be done at any time.
>>>
>>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> for a SIP, given the scope of the work. The document attached even
>>> somewhat matches the proposed format. So if anyone wants to try out
>>> the process...
>>>
>>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger 
>>> wrote:
>>> > Now that spark summit europe is over, are any committers interested in
>>> > moving forward with this?
>>> >
>>> >
>>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >
>>> > Or are we going to let this discussion die on the vine?
>>> >
>>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> >  wrote:
>>> >> Maybe my mail was not clear enough.
>>> >>
>>> >>
>>> >> I didn't want to write "lets focus on Flink" or any other framework.
>>> >> The
>>> >> idea with benchmarks was to show two things:
>>> >>
>>> >> - why some people are doing bad PR for Spark
>>> >>
>>> >> - how - in easy way - we can change it and show that Spark is still on
>>> >> the
>>> >> top
>>> >>
>>> >>
>>> >> No more, no less. Benchmarks will be helpful, but I don't think
>>> >> they're the
>>> >> most important thing in Spark :) On the Spark main page there is still
>>> >> chart
>>> >> "Spark vs Hadoop". It is important to show that framework is not the
>>> >> same
>>> >> Spark with other API, but much faster and optimized, comparable or
>>> >> even
>>> >> faster than other frameworks.
>>> >>
>>> >>
>>> >> About real-time streaming, I think it would be just good to see it in
>>> >> Spark.
>>> >> I very like current Spark model, but many voices that says 

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
I just looked through the entire thread again tonight - there are a lot of
great ideas being discussed. Thanks Cody for taking the first crack at the
proposal.

I want to first comment on the context. Spark is one of the most innovative
and important projects in (big) data -- overall technical decisions made in
Apache Spark are sound. But of course, a project as large and active as
Spark always have room for improvement, and we as a community should strive
to take it to the next level.

To that end, the two biggest areas for improvements in my opinion are:

1. Visibility: There are so much happening that it is difficult to know
what really is going on. For people that don't follow closely, it is
difficult to know what the important initiatives are. Even for people that
do follow, it is difficult to know what specific things require their
attention, since the number of pull requests and JIRA tickets are high and
it's difficult to extract signal from noise.

2. Solicit user (broadly defined, including developers themselves) input
more proactively: At the end of the day the project provides value because
users use it. Users can't tell us exactly what to build, but it is
important to get their inputs.


I've taken Cody's doc and edited it:
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
 (I've made all my modifications trackable)

There are couple high level changes I made:

1. I've consulted a board member and he recommended lazy consensus as
opposed to voting. The reason being in voting there can easily be a "loser'
that gets outvoted.

2. I made it lighter weight, and renamed "strategy" to "optional design
sketch". Echoing one of the earlier email: "IMHO so far aside from tagging
things and linking them elsewhere simply having design docs and prototypes
implementations in PRs is not something that has not worked so far".

3. I made some the language tweaks to focus more on visibility. For
example, "The purpose of an SIP is to inform and involve", rather than just
"involve". SIPs should also have at least two emails that go to dev@.


While I was editing this, I thought we really needed a suggested template
for design doc too. I will get to that too ...


On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin  wrote:

> Most things looked OK to me too, although I do plan to take a closer look
> after Nov 1st when we cut the release branch for 2.1.
>
>
> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin 
> wrote:
>
>> The proposal looks OK to me. I assume, even though it's not explicitly
>> called, that voting would happen by e-mail? A template for the
>> proposal document (instead of just a bullet nice) would also be nice,
>> but that can be done at any time.
>>
>> BTW, shameless plug: I filed SPARK-18085
>>  which I consider a
>> candidate
>> for a SIP, given the scope of the work. The document attached even
>> somewhat matches the proposed format. So if anyone wants to try out
>> the process...
>>
>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger 
>> wrote:
>> > Now that spark summit europe is over, are any committers interested in
>> > moving forward with this?
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >
>> > Or are we going to let this discussion die on the vine?
>> >
>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >  wrote:
>> >> Maybe my mail was not clear enough.
>> >>
>> >>
>> >> I didn't want to write "lets focus on Flink" or any other framework.
>> The
>> >> idea with benchmarks was to show two things:
>> >>
>> >> - why some people are doing bad PR for Spark
>> >>
>> >> - how - in easy way - we can change it and show that Spark is still on
>> the
>> >> top
>> >>
>> >>
>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> they're the
>> >> most important thing in Spark :) On the Spark main page there is still
>> chart
>> >> "Spark vs Hadoop". It is important to show that framework is not the
>> same
>> >> Spark with other API, but much faster and optimized, comparable or even
>> >> faster than other frameworks.
>> >>
>> >>
>> >> About real-time streaming, I think it would be just good to see it in
>> Spark.
>> >> I very like current Spark model, but many voices that says "we need
>> more" -
>> >> community should listen also them and try to help them. With SIPs it
>> would
>> >> be easier, I've just posted this example as "thing that may be changed
>> with
>> >> SIP".
>> >>
>> >>
>> >> I very like unification via Datasets, but there is a lot of algorithms
>> >> inside - let's make easy API, but with strong background (articles,
>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
>> >> framework.
>> >>
>> >>
>> >> Maybe now my intention will be clearer :) As I said organizational
>> ideas
>> >> were 

Re: Odp.: Spark Improvement Proposals

2016-11-01 Thread Reynold Xin
Most things looked OK to me too, although I do plan to take a closer look
after Nov 1st when we cut the release branch for 2.1.


On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin  wrote:

> The proposal looks OK to me. I assume, even though it's not explicitly
> called, that voting would happen by e-mail? A template for the
> proposal document (instead of just a bullet nice) would also be nice,
> but that can be done at any time.
>
> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
> for a SIP, given the scope of the work. The document attached even
> somewhat matches the proposed format. So if anyone wants to try out
> the process...
>
> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger 
> wrote:
> > Now that spark summit europe is over, are any committers interested in
> > moving forward with this?
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > Or are we going to let this discussion die on the vine?
> >
> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >  wrote:
> >> Maybe my mail was not clear enough.
> >>
> >>
> >> I didn't want to write "lets focus on Flink" or any other framework. The
> >> idea with benchmarks was to show two things:
> >>
> >> - why some people are doing bad PR for Spark
> >>
> >> - how - in easy way - we can change it and show that Spark is still on
> the
> >> top
> >>
> >>
> >> No more, no less. Benchmarks will be helpful, but I don't think they're
> the
> >> most important thing in Spark :) On the Spark main page there is still
> chart
> >> "Spark vs Hadoop". It is important to show that framework is not the
> same
> >> Spark with other API, but much faster and optimized, comparable or even
> >> faster than other frameworks.
> >>
> >>
> >> About real-time streaming, I think it would be just good to see it in
> Spark.
> >> I very like current Spark model, but many voices that says "we need
> more" -
> >> community should listen also them and try to help them. With SIPs it
> would
> >> be easier, I've just posted this example as "thing that may be changed
> with
> >> SIP".
> >>
> >>
> >> I very like unification via Datasets, but there is a lot of algorithms
> >> inside - let's make easy API, but with strong background (articles,
> >> benchmarks, descriptions, etc) that shows that Spark is still modern
> >> framework.
> >>
> >>
> >> Maybe now my intention will be clearer :) As I said organizational ideas
> >> were already mentioned and I agree with them, my mail was just to show
> some
> >> aspects from my side, so from theside of developer and person who is
> trying
> >> to help others with Spark (via StackOverflow or other ways)
> >>
> >>
> >> Pozdrawiam / Best regards,
> >>
> >> Tomasz
> >>
> >>
> >> 
> >> Od: Cody Koeninger 
> >> Wysłane: 17 października 2016 16:46
> >> Do: Debasish Das
> >> DW: Tomasz Gawęda; dev@spark.apache.org
> >> Temat: Re: Spark Improvement Proposals
> >>
> >> I think narrowly focusing on Flink or benchmarks is missing my point.
> >>
> >> My point is evolve or die.  Spark's governance and organization is
> >> hampering its ability to evolve technologically, and it needs to
> >> change.
> >>
> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das  >
> >> wrote:
> >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014
> as
> >>> soon as I looked into it since compared to writing Java map-reduce and
> >>> Cascading code, Spark made writing distributed code fun...But now as we
> >>> went
> >>> deeper with Spark and real-time streaming use-case gets more
> prominent, I
> >>> think it is time to bring a messaging model in conjunction with the
> >>> batch/micro-batch API that Spark is good atakka-streams close
> >>> integration with spark micro-batching APIs looks like a great
> direction to
> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
> with
> >>> batch with the assumption is that micro-batching is sufficient to run
> SQL
> >>> commands on stream but do we really have time to do SQL processing at
> >>> streaming data within 1-2 seconds ?
> >>>
> >>> After reading the email chain, I started to look into Flink
> documentation
> >>> and if you compare it with Spark documentation, I think we have major
> work
> >>> to do detailing out Spark internals so that more people from community
> >>> start
> >>> to take active role in improving the issues so that Spark stays strong
> >>> compared to Flink.
> >>>
> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>>
> >>> Spark is no longer an engine that works for micro-batch and batch...We
> >>> (and
> >>> I am sure many others) are pushing spark as an engine for stream and
> query
> >>> processing.we need to make it a state-of-the-art engine for high

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Marcelo Vanzin
The proposal looks OK to me. I assume, even though it's not explicitly
called, that voting would happen by e-mail? A template for the
proposal document (instead of just a bullet nice) would also be nice,
but that can be done at any time.

BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
for a SIP, given the scope of the work. The document attached even
somewhat matches the proposed format. So if anyone wants to try out
the process...

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger  wrote:
> Now that spark summit europe is over, are any committers interested in
> moving forward with this?
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>
> Or are we going to let this discussion die on the vine?
>
> On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>  wrote:
>> Maybe my mail was not clear enough.
>>
>>
>> I didn't want to write "lets focus on Flink" or any other framework. The
>> idea with benchmarks was to show two things:
>>
>> - why some people are doing bad PR for Spark
>>
>> - how - in easy way - we can change it and show that Spark is still on the
>> top
>>
>>
>> No more, no less. Benchmarks will be helpful, but I don't think they're the
>> most important thing in Spark :) On the Spark main page there is still chart
>> "Spark vs Hadoop". It is important to show that framework is not the same
>> Spark with other API, but much faster and optimized, comparable or even
>> faster than other frameworks.
>>
>>
>> About real-time streaming, I think it would be just good to see it in Spark.
>> I very like current Spark model, but many voices that says "we need more" -
>> community should listen also them and try to help them. With SIPs it would
>> be easier, I've just posted this example as "thing that may be changed with
>> SIP".
>>
>>
>> I very like unification via Datasets, but there is a lot of algorithms
>> inside - let's make easy API, but with strong background (articles,
>> benchmarks, descriptions, etc) that shows that Spark is still modern
>> framework.
>>
>>
>> Maybe now my intention will be clearer :) As I said organizational ideas
>> were already mentioned and I agree with them, my mail was just to show some
>> aspects from my side, so from theside of developer and person who is trying
>> to help others with Spark (via StackOverflow or other ways)
>>
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> 
>> Od: Cody Koeninger 
>> Wysłane: 17 października 2016 16:46
>> Do: Debasish Das
>> DW: Tomasz Gawęda; dev@spark.apache.org
>> Temat: Re: Spark Improvement Proposals
>>
>> I think narrowly focusing on Flink or benchmarks is missing my point.
>>
>> My point is evolve or die.  Spark's governance and organization is
>> hampering its ability to evolve technologically, and it needs to
>> change.
>>
>> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das 
>> wrote:
>>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>>> soon as I looked into it since compared to writing Java map-reduce and
>>> Cascading code, Spark made writing distributed code fun...But now as we
>>> went
>>> deeper with Spark and real-time streaming use-case gets more prominent, I
>>> think it is time to bring a messaging model in conjunction with the
>>> batch/micro-batch API that Spark is good atakka-streams close
>>> integration with spark micro-batching APIs looks like a great direction to
>>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>>> batch with the assumption is that micro-batching is sufficient to run SQL
>>> commands on stream but do we really have time to do SQL processing at
>>> streaming data within 1-2 seconds ?
>>>
>>> After reading the email chain, I started to look into Flink documentation
>>> and if you compare it with Spark documentation, I think we have major work
>>> to do detailing out Spark internals so that more people from community
>>> start
>>> to take active role in improving the issues so that Spark stays strong
>>> compared to Flink.
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>>
>>> Spark is no longer an engine that works for micro-batch and batch...We
>>> (and
>>> I am sure many others) are pushing spark as an engine for stream and query
>>> processing.we need to make it a state-of-the-art engine for high speed
>>> streaming data and user queries as well !
>>>
>>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
>>> wrote:

 Hi everyone,

 I'm quite late with my answer, but I think my suggestions may help a
 little bit. :) Many technical and organizational topics were mentioned,
 but I want to focus on these negative posts about Spark and about
 "haters"

 I really like Spark. Easy of use, speed, 

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Ryan Blue
I agree, we should push forward on this. I think there is enough consensus
to call a vote, unless someone else thinks that there is more to discuss?

rb

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger  wrote:

> Now that spark summit europe is over, are any committers interested in
> moving forward with this?
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
>
> Or are we going to let this discussion die on the vine?
>
> On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>  wrote:
> > Maybe my mail was not clear enough.
> >
> >
> > I didn't want to write "lets focus on Flink" or any other framework. The
> > idea with benchmarks was to show two things:
> >
> > - why some people are doing bad PR for Spark
> >
> > - how - in easy way - we can change it and show that Spark is still on
> the
> > top
> >
> >
> > No more, no less. Benchmarks will be helpful, but I don't think they're
> the
> > most important thing in Spark :) On the Spark main page there is still
> chart
> > "Spark vs Hadoop". It is important to show that framework is not the same
> > Spark with other API, but much faster and optimized, comparable or even
> > faster than other frameworks.
> >
> >
> > About real-time streaming, I think it would be just good to see it in
> Spark.
> > I very like current Spark model, but many voices that says "we need
> more" -
> > community should listen also them and try to help them. With SIPs it
> would
> > be easier, I've just posted this example as "thing that may be changed
> with
> > SIP".
> >
> >
> > I very like unification via Datasets, but there is a lot of algorithms
> > inside - let's make easy API, but with strong background (articles,
> > benchmarks, descriptions, etc) that shows that Spark is still modern
> > framework.
> >
> >
> > Maybe now my intention will be clearer :) As I said organizational ideas
> > were already mentioned and I agree with them, my mail was just to show
> some
> > aspects from my side, so from theside of developer and person who is
> trying
> > to help others with Spark (via StackOverflow or other ways)
> >
> >
> > Pozdrawiam / Best regards,
> >
> > Tomasz
> >
> >
> > 
> > Od: Cody Koeninger 
> > Wysłane: 17 października 2016 16:46
> > Do: Debasish Das
> > DW: Tomasz Gawęda; dev@spark.apache.org
> > Temat: Re: Spark Improvement Proposals
> >
> > I think narrowly focusing on Flink or benchmarks is missing my point.
> >
> > My point is evolve or die.  Spark's governance and organization is
> > hampering its ability to evolve technologically, and it needs to
> > change.
> >
> > On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das 
> > wrote:
> >> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> >> soon as I looked into it since compared to writing Java map-reduce and
> >> Cascading code, Spark made writing distributed code fun...But now as we
> >> went
> >> deeper with Spark and real-time streaming use-case gets more prominent,
> I
> >> think it is time to bring a messaging model in conjunction with the
> >> batch/micro-batch API that Spark is good atakka-streams close
> >> integration with spark micro-batching APIs looks like a great direction
> to
> >> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> >> batch with the assumption is that micro-batching is sufficient to run
> SQL
> >> commands on stream but do we really have time to do SQL processing at
> >> streaming data within 1-2 seconds ?
> >>
> >> After reading the email chain, I started to look into Flink
> documentation
> >> and if you compare it with Spark documentation, I think we have major
> work
> >> to do detailing out Spark internals so that more people from community
> >> start
> >> to take active role in improving the issues so that Spark stays strong
> >> compared to Flink.
> >>
> >> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>
> >> Spark is no longer an engine that works for micro-batch and batch...We
> >> (and
> >> I am sure many others) are pushing spark as an engine for stream and
> query
> >> processing.we need to make it a state-of-the-art engine for high
> speed
> >> streaming data and user queries as well !
> >>
> >> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <
> tomasz.gaw...@outlook.com>
> >> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> I'm quite late with my answer, but I think my suggestions may help a
> >>> little bit. :) Many technical and organizational topics were mentioned,
> >>> but I want to focus on these negative posts about Spark and about
> >>> "haters"
> >>>
> >>> I really like Spark. Easy of use, speed, very good community - it's
> >>> everything here. But Every project has to "flight" on "framework
> market"
> >>> to be still no 1. I'm following many Spark and Big Data 

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Cody Koeninger
Now that spark summit europe is over, are any committers interested in
moving forward with this?

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Or are we going to let this discussion die on the vine?

On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
 wrote:
> Maybe my mail was not clear enough.
>
>
> I didn't want to write "lets focus on Flink" or any other framework. The
> idea with benchmarks was to show two things:
>
> - why some people are doing bad PR for Spark
>
> - how - in easy way - we can change it and show that Spark is still on the
> top
>
>
> No more, no less. Benchmarks will be helpful, but I don't think they're the
> most important thing in Spark :) On the Spark main page there is still chart
> "Spark vs Hadoop". It is important to show that framework is not the same
> Spark with other API, but much faster and optimized, comparable or even
> faster than other frameworks.
>
>
> About real-time streaming, I think it would be just good to see it in Spark.
> I very like current Spark model, but many voices that says "we need more" -
> community should listen also them and try to help them. With SIPs it would
> be easier, I've just posted this example as "thing that may be changed with
> SIP".
>
>
> I very like unification via Datasets, but there is a lot of algorithms
> inside - let's make easy API, but with strong background (articles,
> benchmarks, descriptions, etc) that shows that Spark is still modern
> framework.
>
>
> Maybe now my intention will be clearer :) As I said organizational ideas
> were already mentioned and I agree with them, my mail was just to show some
> aspects from my side, so from theside of developer and person who is trying
> to help others with Spark (via StackOverflow or other ways)
>
>
> Pozdrawiam / Best regards,
>
> Tomasz
>
>
> 
> Od: Cody Koeninger 
> Wysłane: 17 października 2016 16:46
> Do: Debasish Das
> DW: Tomasz Gawęda; dev@spark.apache.org
> Temat: Re: Spark Improvement Proposals
>
> I think narrowly focusing on Flink or benchmarks is missing my point.
>
> My point is evolve or die.  Spark's governance and organization is
> hampering its ability to evolve technologically, and it needs to
> change.
>
> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das 
> wrote:
>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>> soon as I looked into it since compared to writing Java map-reduce and
>> Cascading code, Spark made writing distributed code fun...But now as we
>> went
>> deeper with Spark and real-time streaming use-case gets more prominent, I
>> think it is time to bring a messaging model in conjunction with the
>> batch/micro-batch API that Spark is good atakka-streams close
>> integration with spark micro-batching APIs looks like a great direction to
>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>> batch with the assumption is that micro-batching is sufficient to run SQL
>> commands on stream but do we really have time to do SQL processing at
>> streaming data within 1-2 seconds ?
>>
>> After reading the email chain, I started to look into Flink documentation
>> and if you compare it with Spark documentation, I think we have major work
>> to do detailing out Spark internals so that more people from community
>> start
>> to take active role in improving the issues so that Spark stays strong
>> compared to Flink.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>
>> Spark is no longer an engine that works for micro-batch and batch...We
>> (and
>> I am sure many others) are pushing spark as an engine for stream and query
>> processing.we need to make it a state-of-the-art engine for high speed
>> streaming data and user queries as well !
>>
>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> I'm quite late with my answer, but I think my suggestions may help a
>>> little bit. :) Many technical and organizational topics were mentioned,
>>> but I want to focus on these negative posts about Spark and about
>>> "haters"
>>>
>>> I really like Spark. Easy of use, speed, very good community - it's
>>> everything here. But Every project has to "flight" on "framework market"
>>> to be still no 1. I'm following many Spark and Big Data communities,
>>> maybe my mail will inspire someone :)
>>>
>>> You (every Spark developer; so far I didn't have enough time to join
>>> contributing to Spark) has done excellent job. So why are some people
>>> saying that Flink (or other framework) is better, like it was posted in
>>> this mailing list? No, not because that framework is better in all
>>> cases.. In my opinion, many of these discussions where started after
>>> Flink marketing-like posts. Please look at StackOverflow 

Odp.: Spark Improvement Proposals

2016-10-17 Thread Tomasz Gawęda
Maybe my mail was not clear enough.


I didn't want to write "lets focus on Flink" or any other framework. The idea 
with benchmarks was to show two things:

- why some people are doing bad PR for Spark

- how - in easy way - we can change it and show that Spark is still on the top


No more, no less. Benchmarks will be helpful, but I don't think they're the 
most important thing in Spark :) On the Spark main page there is still chart 
"Spark vs Hadoop". It is important to show that framework is not the same Spark 
with other API, but much faster and optimized, comparable or even faster than 
other frameworks.


About real-time streaming, I think it would be just good to see it in Spark. I 
very like current Spark model, but many voices that says "we need more" - 
community should listen also them and try to help them. With SIPs it would be 
easier, I've just posted this example as "thing that may be changed with SIP".


I very like unification via Datasets, but there is a lot of algorithms inside - 
let's make easy API, but with strong background (articles, benchmarks, 
descriptions, etc) that shows that Spark is still modern framework.


Maybe now my intention will be clearer :) As I said organizational ideas were 
already mentioned and I agree with them, my mail was just to show some aspects 
from my side, so from theside of developer and person who is trying to help 
others with Spark (via StackOverflow or other ways)


Pozdrawiam / Best regards,

Tomasz



Od: Cody Koeninger 
Wysłane: 17 października 2016 16:46
Do: Debasish Das
DW: Tomasz Gawęda; dev@spark.apache.org
Temat: Re: Spark Improvement Proposals

I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das  wrote:
> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good atakka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs "
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often PMC's) are
>> just posting same information about real-time streaming, about delta
>> iterations, etc. It look smart and very often it is marked as an aswer,
>> even if - in my opinion - there wasn't told all the truth.
>>
>>
>> My suggestion: I don't have enough money and knowledgle to perform huge
>> performance