[VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-07 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version
2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc3
(584354eaac02531c9584188b143367ba694b0c34)

This release candidate resolves 84 issues:
https://s.apache.org/spark-2.0.2-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1214/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 2.0.1.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already present
in 2.0.1, missing features, or bugs related to new features will not
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from
now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
(i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.


[ANNOUNCE] Announcing Apache Spark 1.6.3

2016-11-07 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.3! This maintenance
release includes fixes across several areas of Spark and encourage users on
the 1.6.x line to upgrade to 1.6.3.

Head to the project's download page to download the new version:
http://spark.apache.org/downloads.html


Re: Handling questions in the mailing lists

2016-11-07 Thread Denny Lee
To help track and get the verbiage for the Spark community page and welcome
email jump started, here's a working document for us to work with:
https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCjufBh2s/edit#

Hope this will help us collaborate on this stuff a little faster.

On Mon, Nov 7, 2016 at 2:25 PM Maciej Szymkiewicz 
wrote:

> Just a couple of random thoughts regarding Stack Overflow...
>
>- If we are thinking about shifting focus towards SO all attempts of
>micromanaging should be discarded right in the beginning. Especially things
>like meta tags, which are discouraged and "burninated" (
>https://meta.stackoverflow.com/tags/burninate-request/info) , or
>thread bumping. Depending on a context these won't be manageable, go
>against community guidelines or simply obsolete.
>- Lack of expertise is unlikely an issue. Even now there is a number
>of advanced Spark users on SO. Of course the more the merrier.
>
> Things that can be easily improved:
>
>- Identifying, improving and promoting canonical questions and
>answers. It means closing duplicate, suggesting edits to improve existing
>answers, providing alternative solutions. This can be also used to identify
>gaps in the documentation.
>- Providing a set of clear posting guidelines to reduce effort
>required to identify the problem (think about
>http://stackoverflow.com/q/5963269 a.k.a How to make a great R
>reproducible example?)
>- Helping users decide if question is a good fit for SO (see below).
>API questions are great fit, debugging problems like "my cluster is slow"
>are not.
>- Actively cleaning (closing, deleting) off-topic and low quality
>questions. The less junk to sieve through the better chance of good
>questions being answered.
>- Repurposing and actively moderating SO docs (
>https://stackoverflow.com/documentation/apache-spark/topics). Right
>now most of the stuff that goes there is useless, duplicated or
>plagiarized, or border case SPAM.
>- Encouraging community to monitor featured (
>https://stackoverflow.com/questions/tagged/apache-spark?sort=featured)
>and active & upvoted & unanswered (
>https://stackoverflow.com/unanswered/tagged/apache-spark) questions.
>- Implementing some procedure to identify questions which are likely
>to be bugs or a material for feature requests. Personally I am quite often
>tempted to simply send a link to dev list, but I don't think it is really
>acceptable.
>- Animating Spark related chat room. I tried this a couple of times
>but to no avail. Without a certain critical mass of users it just won't
>work.
>
>
>
> On 11/07/2016 07:32 AM, Reynold Xin wrote:
>
> This is an excellent point. If we do go ahead and feature SO as a way for
> users to ask questions more prominently, as someone who knows SO very well,
> would you be willing to help write a short guideline (ideally the shorter
> the better, which makes it hard) to direct what goes to user@ and what
> goes to SO?
>
>
> Sure, I'll be happy to help if I can.
>
>
>
>
> On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz  > wrote:
>
> Damn, I always thought that mailing list is only for nice and welcoming
> people and there is nothing to do for me here >:)
>
> To be serious though, there are many questions on the users list which
> would fit just fine on SO but it is not true in general. There are dozens
> of questions which are to broad, opinion based, ask for external resources
> and so on. If you want to direct users to SO you have to help them to
> decide if it is the right channel. Otherwise it will just create a really
> bad experience for both seeking help and active answerers. Former ones will
> be downvoted and bashed, latter ones will have to deal with handling all
> the junk and the number of active Spark users with moderation privileges is
> really low (with only Massg and me being able to directly close duplicates).
>
> Believe me, I've seen this before.
> On 11/07/2016 05:08 AM, Reynold Xin wrote:
>
> You have substantially underestimated how opinionated people can be on
> mailing lists too :)
>
> On Sunday, November 6, 2016, Maciej Szymkiewicz 
> wrote:
>
> You have to remember that Stack Overflow crowd (like me) is highly
> opinionated, so many questions, which could be just fine on the mailing
> list, will be quickly downvoted and / or closed as off-topic. Just
> saying...
>
> --
> Best,
> Maciej
>
>
> On 11/07/2016 04:03 AM, Reynold Xin wrote:
>
> OK I've checked on the ASF member list (which is private so there is no
> public archive).
>
> It is not against any ASF rule to recommend StackOverflow as a place for
> users to ask questions. I don't think we can or should delete the existing
> user@spark list either, but we can certainly make SO more visible than it
> is.
>
>
>
> On Wed, Nov 2, 2016 

Re: REST api for monitoring Spark Streaming

2016-11-07 Thread Chan Chor Pang

Thank you

this should take me at least a few days, and will let you know as soon 
as the PR ready.



On 11/8/16 11:44 AM, Tathagata Das wrote:
This may be a good addition. I suggest you read our guidelines on 
contributing code to Spark.


https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PreparingtoContributeCodeChanges

Its long document but it should have everything for you to figure out 
how to contribute your changes. I hope to see your changes in a Github 
PR soon!


TD

On Mon, Nov 7, 2016 at 5:30 PM, Chan Chor Pang > wrote:


hi everyone

it seems that there is not much who interested in creating a api
for Streaming.
never the less I still really want the api for monitoring.
so i tried to see if i can implement by my own.

after some test,
i believe i can achieve the goal by
1. implement a package(org.apache.spark.streaming.status.api.v1)
that serve the same purpose as org.apache.spark.status.api.v1
2. register the api path through StreamingTab
and 3. retrive the streaming informateion through
StreamingJobProgressListener

what my most concern now is will my implementation be able to
merge to the main stream.

im new to open source project, so anyone could please show me some
light?
how should/could i proceed to make my implementation to be able to
merge to the main stream.


here is my test code base on v1.6.0
###
diff --git

a/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala

b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala
new file mode 100644
index 000..690e2d8
--- /dev/null
+++

b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala
@@ -0,0 +1,68 @@
+package org.apache.spark.streaming.status.api.v1
+
+import java.io.OutputStream
+import java.lang.annotation.Annotation
+import java.lang.reflect.Type
+import java.text.SimpleDateFormat
+import java.util.{Calendar, SimpleTimeZone}
+import javax.ws.rs.Produces
+import javax.ws.rs.core.{MediaType, MultivaluedMap}
+import javax.ws.rs.ext.{MessageBodyWriter, Provider}
+
+import com.fasterxml.jackson.annotation.JsonInclude
+import com.fasterxml.jackson.databind.{ObjectMapper,
SerializationFeature}
+
+@Provider
+@Produces(Array(MediaType.APPLICATION_JSON))
+private[v1] class JacksonMessageWriter extends
MessageBodyWriter[Object]{
+
+  val mapper = new ObjectMapper() {
+override def writeValueAsString(t: Any): String = {
+  super.writeValueAsString(t)
+}
+  }
+
mapper.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModule)
+  mapper.enable(SerializationFeature.INDENT_OUTPUT)
+  mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)
+  mapper.setDateFormat(JacksonMessageWriter.makeISODateFormat)
+
+  override def isWriteable(
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType): Boolean = {
+  true
+  }
+
+  override def writeTo(
+  t: Object,
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType,
+  multivaluedMap: MultivaluedMap[String, AnyRef],
+  outputStream: OutputStream): Unit = {
+t match {
+  //case ErrorWrapper(err) =>
outputStream.write(err.getBytes("utf-8"))
+  case _ => mapper.writeValue(outputStream, t)
+}
+  }
+
+  override def getSize(
+  t: Object,
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType): Long = {
+-1L
+  }
+}
+
+private[spark] object JacksonMessageWriter {
+  def makeISODateFormat: SimpleDateFormat = {
+val iso8601 = new
SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'GMT'")
+val cal = Calendar.getInstance(new SimpleTimeZone(0, "GMT"))
+iso8601.setCalendar(cal)
+iso8601
+  }
+}
diff --git

a/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala

b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala
new file mode 100644
index 000..f4e43dd
--- /dev/null
+++

b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala
@@ -0,0 +1,74 @@
+package org.apache.spark.streaming.status.api.v1
+
+import org.apache.spark.status.api.v1.UIRoot
+import org.eclipse.jetty.server.handler.ContextHandler
+import 

Re: REST api for monitoring Spark Streaming

2016-11-07 Thread Tathagata Das
This may be a good addition. I suggest you read our guidelines on
contributing code to Spark.

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PreparingtoContributeCodeChanges

Its long document but it should have everything for you to figure out how
to contribute your changes. I hope to see your changes in a Github PR soon!

TD

On Mon, Nov 7, 2016 at 5:30 PM, Chan Chor Pang 
wrote:

> hi everyone
>
> it seems that there is not much who interested in creating a api for
> Streaming.
> never the less I still really want the api for monitoring.
> so i tried to see if i can implement by my own.
>
> after some test,
> i believe i can achieve the goal by
> 1. implement a package(org.apache.spark.streaming.status.api.v1) that
> serve the same purpose as org.apache.spark.status.api.v1
> 2. register the api path through StreamingTab
> and 3. retrive the streaming informateion through
> StreamingJobProgressListener
>
> what my most concern now is will my implementation be able to merge to the
> main stream.
>
> im new to open source project, so anyone could please show me some light?
> how should/could i proceed to make my implementation to be able to merge
> to the main stream.
>
>
> here is my test code base on v1.6.0
> ###
> diff --git a/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/JacksonMessageWriter.scala b/streaming/src/main/scala/org
> /apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala
> new file mode 100644
> index 000..690e2d8
> --- /dev/null
> +++ b/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/JacksonMessageWriter.scala
> @@ -0,0 +1,68 @@
> +package org.apache.spark.streaming.status.api.v1
> +
> +import java.io.OutputStream
> +import java.lang.annotation.Annotation
> +import java.lang.reflect.Type
> +import java.text.SimpleDateFormat
> +import java.util.{Calendar, SimpleTimeZone}
> +import javax.ws.rs.Produces
> +import javax.ws.rs.core.{MediaType, MultivaluedMap}
> +import javax.ws.rs.ext.{MessageBodyWriter, Provider}
> +
> +import com.fasterxml.jackson.annotation.JsonInclude
> +import com.fasterxml.jackson.databind.{ObjectMapper,
> SerializationFeature}
> +
> +@Provider
> +@Produces(Array(MediaType.APPLICATION_JSON))
> +private[v1] class JacksonMessageWriter extends MessageBodyWriter[Object]{
> +
> +  val mapper = new ObjectMapper() {
> +override def writeValueAsString(t: Any): String = {
> +  super.writeValueAsString(t)
> +}
> +  }
> + mapper.registerModule(com.fasterxml.jackson.module.scala.
> DefaultScalaModule)
> +  mapper.enable(SerializationFeature.INDENT_OUTPUT)
> +  mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)
> +  mapper.setDateFormat(JacksonMessageWriter.makeISODateFormat)
> +
> +  override def isWriteable(
> +  aClass: Class[_],
> +  `type`: Type,
> +  annotations: Array[Annotation],
> +  mediaType: MediaType): Boolean = {
> +  true
> +  }
> +
> +  override def writeTo(
> +  t: Object,
> +  aClass: Class[_],
> +  `type`: Type,
> +  annotations: Array[Annotation],
> +  mediaType: MediaType,
> +  multivaluedMap: MultivaluedMap[String, AnyRef],
> +  outputStream: OutputStream): Unit = {
> +t match {
> +  //case ErrorWrapper(err) => outputStream.write(err.getByte
> s("utf-8"))
> +  case _ => mapper.writeValue(outputStream, t)
> +}
> +  }
> +
> +  override def getSize(
> +  t: Object,
> +  aClass: Class[_],
> +  `type`: Type,
> +  annotations: Array[Annotation],
> +  mediaType: MediaType): Long = {
> +-1L
> +  }
> +}
> +
> +private[spark] object JacksonMessageWriter {
> +  def makeISODateFormat: SimpleDateFormat = {
> +val iso8601 = new SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'GMT'")
> +val cal = Calendar.getInstance(new SimpleTimeZone(0, "GMT"))
> +iso8601.setCalendar(cal)
> +iso8601
> +  }
> +}
> diff --git a/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/StreamingApiRootResource.scala b/streaming/src/main/scala/org
> /apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala
> new file mode 100644
> index 000..f4e43dd
> --- /dev/null
> +++ b/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/StreamingApiRootResource.scala
> @@ -0,0 +1,74 @@
> +package org.apache.spark.streaming.status.api.v1
> +
> +import org.apache.spark.status.api.v1.UIRoot
> +import org.eclipse.jetty.server.handler.ContextHandler
> +import org.eclipse.jetty.servlet.ServletContextHandler
> +import org.eclipse.jetty.servlet.ServletHolder
> +
> +import com.sun.jersey.spi.container.servlet.ServletContainer
> +
> +import javax.servlet.ServletContext
> +import javax.ws.rs.Path
> +import javax.ws.rs.Produces
> +import javax.ws.rs.core.Context
> +import org.apache.spark.streaming.ui.StreamingJobProgressListener
> +
> +
> +@Path("/v1")
> +private[v1] class StreamingApiRootResource 

Re: REST api for monitoring Spark Streaming

2016-11-07 Thread Chan Chor Pang

hi everyone

it seems that there is not much who interested in creating a api for 
Streaming.

never the less I still really want the api for monitoring.
so i tried to see if i can implement by my own.

after some test,
i believe i can achieve the goal by
1. implement a package(org.apache.spark.streaming.status.api.v1) that 
serve the same purpose as org.apache.spark.status.api.v1

2. register the api path through StreamingTab
and 3. retrive the streaming informateion through 
StreamingJobProgressListener


what my most concern now is will my implementation be able to merge to 
the main stream.


im new to open source project, so anyone could please show me some light?
how should/could i proceed to make my implementation to be able to merge 
to the main stream.



here is my test code base on v1.6.0
###
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala

new file mode 100644
index 000..690e2d8
--- /dev/null
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala

@@ -0,0 +1,68 @@
+package org.apache.spark.streaming.status.api.v1
+
+import java.io.OutputStream
+import java.lang.annotation.Annotation
+import java.lang.reflect.Type
+import java.text.SimpleDateFormat
+import java.util.{Calendar, SimpleTimeZone}
+import javax.ws.rs.Produces
+import javax.ws.rs.core.{MediaType, MultivaluedMap}
+import javax.ws.rs.ext.{MessageBodyWriter, Provider}
+
+import com.fasterxml.jackson.annotation.JsonInclude
+import com.fasterxml.jackson.databind.{ObjectMapper, SerializationFeature}
+
+@Provider
+@Produces(Array(MediaType.APPLICATION_JSON))
+private[v1] class JacksonMessageWriter extends MessageBodyWriter[Object]{
+
+  val mapper = new ObjectMapper() {
+override def writeValueAsString(t: Any): String = {
+  super.writeValueAsString(t)
+}
+  }
+ 
mapper.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModule)

+  mapper.enable(SerializationFeature.INDENT_OUTPUT)
+  mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)
+  mapper.setDateFormat(JacksonMessageWriter.makeISODateFormat)
+
+  override def isWriteable(
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType): Boolean = {
+  true
+  }
+
+  override def writeTo(
+  t: Object,
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType,
+  multivaluedMap: MultivaluedMap[String, AnyRef],
+  outputStream: OutputStream): Unit = {
+t match {
+  //case ErrorWrapper(err) => outputStream.write(err.getBytes("utf-8"))
+  case _ => mapper.writeValue(outputStream, t)
+}
+  }
+
+  override def getSize(
+  t: Object,
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType): Long = {
+-1L
+  }
+}
+
+private[spark] object JacksonMessageWriter {
+  def makeISODateFormat: SimpleDateFormat = {
+val iso8601 = new SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'GMT'")
+val cal = Calendar.getInstance(new SimpleTimeZone(0, "GMT"))
+iso8601.setCalendar(cal)
+iso8601
+  }
+}
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala

new file mode 100644
index 000..f4e43dd
--- /dev/null
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala

@@ -0,0 +1,74 @@
+package org.apache.spark.streaming.status.api.v1
+
+import org.apache.spark.status.api.v1.UIRoot
+import org.eclipse.jetty.server.handler.ContextHandler
+import org.eclipse.jetty.servlet.ServletContextHandler
+import org.eclipse.jetty.servlet.ServletHolder
+
+import com.sun.jersey.spi.container.servlet.ServletContainer
+
+import javax.servlet.ServletContext
+import javax.ws.rs.Path
+import javax.ws.rs.Produces
+import javax.ws.rs.core.Context
+import org.apache.spark.streaming.ui.StreamingJobProgressListener
+
+
+@Path("/v1")
+private[v1] class StreamingApiRootResource extends 
UIRootFromServletContext{

+
+  @Path("streaminginfo")
+  def getStreamingInfo(): StreamingInfoResource = {
+new StreamingInfoResource(uiRoot,listener)
+  }
+
+}
+
+private[spark] object StreamingApiRootResource {
+
+  def getServletHandler(uiRoot: UIRoot, 
listener:StreamingJobProgressListener): ServletContextHandler = {

+
+val jerseyContext = new 
ServletContextHandler(ServletContextHandler.NO_SESSIONS)

+jerseyContext.setContextPath("/streamingapi")
+val holder: ServletHolder = new 
ServletHolder(classOf[ServletContainer])
+ 
holder.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",

+  "com.sun.jersey.api.core.PackagesResourceConfig")
+ 

Re: Handling questions in the mailing lists

2016-11-07 Thread Maciej Szymkiewicz
Just a couple of random thoughts regarding Stack Overflow...

  * If we are thinking about shifting focus towards SO all attempts of
micromanaging should be discarded right in the beginning. Especially
things like meta tags, which are discouraged and "burninated"
(https://meta.stackoverflow.com/tags/burninate-request/info) , or
thread bumping. Depending on a context these won't be manageable, go
against community guidelines or simply obsolete. 
  * Lack of expertise is unlikely an issue. Even now there is a number
of advanced Spark users on SO. Of course the more the merrier.

Things that can be easily improved:

  * Identifying, improving and promoting canonical questions and
answers. It means closing duplicate, suggesting edits to improve
existing answers, providing alternative solutions. This can be also
used to identify gaps in the documentation.
  * Providing a set of clear posting guidelines to reduce effort
required to identify the problem (think about
http://stackoverflow.com/q/5963269 a.k.a How to make a great R
reproducible example?)
  * Helping users decide if question is a good fit for SO (see below).
API questions are great fit, debugging problems like "my cluster is
slow" are not.
  * Actively cleaning (closing, deleting) off-topic and low quality
questions. The less junk to sieve through the better chance of good
questions being answered.
  * Repurposing and actively moderating SO docs
(https://stackoverflow.com/documentation/apache-spark/topics). Right
now most of the stuff that goes there is useless, duplicated or
plagiarized, or border case SPAM.
  * Encouraging community to monitor featured
(https://stackoverflow.com/questions/tagged/apache-spark?sort=featured)
and active & upvoted & unanswered
(https://stackoverflow.com/unanswered/tagged/apache-spark) questions.
  * Implementing some procedure to identify questions which are likely
to be bugs or a material for feature requests. Personally I am quite
often tempted to simply send a link to dev list, but I don't think
it is really acceptable.
  * Animating Spark related chat room. I tried this a couple of times
but to no avail. Without a certain critical mass of users it just
won't work.



On 11/07/2016 07:32 AM, Reynold Xin wrote:
> This is an excellent point. If we do go ahead and feature SO as a way
> for users to ask questions more prominently, as someone who knows SO
> very well, would you be willing to help write a short guideline
> (ideally the shorter the better, which makes it hard) to direct what
> goes to user@ and what goes to SO?

Sure, I'll be happy to help if I can.

>
>
> On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz
> > wrote:
>
> Damn, I always thought that mailing list is only for nice and
> welcoming people and there is nothing to do for me here >:)
>
> To be serious though, there are many questions on the users list
> which would fit just fine on SO but it is not true in general.
> There are dozens of questions which are to broad, opinion based,
> ask for external resources and so on. If you want to direct users
> to SO you have to help them to decide if it is the right channel.
> Otherwise it will just create a really bad experience for both
> seeking help and active answerers. Former ones will be downvoted
> and bashed, latter ones will have to deal with handling all the
> junk and the number of active Spark users with moderation
> privileges is really low (with only Massg and me being able to
> directly close duplicates).
>
> Believe me, I've seen this before.
>
> On 11/07/2016 05:08 AM, Reynold Xin wrote:
>> You have substantially underestimated how opinionated people can
>> be on mailing lists too :)
>>
>> On Sunday, November 6, 2016, Maciej Szymkiewicz
>> > wrote:
>>
>> You have to remember that Stack Overflow crowd (like me) is
>> highly opinionated, so many questions, which could be just
>> fine on the mailing list, will be quickly downvoted and / or
>> closed as off-topic. Just saying...
>>
>> -- 
>> Best, 
>> Maciej
>>
>>
>> On 11/07/2016 04:03 AM, Reynold Xin wrote:
>>> OK I've checked on the ASF member list (which is private so
>>> there is no public archive).
>>>
>>> It is not against any ASF rule to recommend StackOverflow as
>>> a place for users to ask questions. I don't think we can or
>>> should delete the existing user@spark list either, but we
>>> can certainly make SO more visible than it is.
>>>
>>>
>>>
>>> On Wed, Nov 2, 2016 at 10:21 AM, Reynold Xin
>>>  wrote:
>>>
>>> Actually after talking with more ASF members, I believe
>>> 

Re: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
It turned out suggested edits (trackable) don't show up for non-owners, so
I've just merged all the edits in place. It should be visible now.

On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin  wrote:

> Oops. Let me try figure that out.
>
>
> On Monday, November 7, 2016, Cody Koeninger  wrote:
>
>> Thanks for picking up on this.
>>
>> Maybe I fail at google docs, but I can't see any edits on the document
>> you linked.
>>
>> Regarding lazy consensus, if the board in general has less of an issue
>> with that, sure.  As long as it is clearly announced, lasts at least
>> 72 hours, and has a clear outcome.
>>
>> The other points are hard to comment on without being able to see the
>> text in question.
>>
>>
>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin  wrote:
>> > I just looked through the entire thread again tonight - there are a lot
>> of
>> > great ideas being discussed. Thanks Cody for taking the first crack at
>> the
>> > proposal.
>> >
>> > I want to first comment on the context. Spark is one of the most
>> innovative
>> > and important projects in (big) data -- overall technical decisions
>> made in
>> > Apache Spark are sound. But of course, a project as large and active as
>> > Spark always have room for improvement, and we as a community should
>> strive
>> > to take it to the next level.
>> >
>> > To that end, the two biggest areas for improvements in my opinion are:
>> >
>> > 1. Visibility: There are so much happening that it is difficult to know
>> what
>> > really is going on. For people that don't follow closely, it is
>> difficult to
>> > know what the important initiatives are. Even for people that do
>> follow, it
>> > is difficult to know what specific things require their attention,
>> since the
>> > number of pull requests and JIRA tickets are high and it's difficult to
>> > extract signal from noise.
>> >
>> > 2. Solicit user (broadly defined, including developers themselves) input
>> > more proactively: At the end of the day the project provides value
>> because
>> > users use it. Users can't tell us exactly what to build, but it is
>> important
>> > to get their inputs.
>> >
>> >
>> > I've taken Cody's doc and edited it:
>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>> > (I've made all my modifications trackable)
>> >
>> > There are couple high level changes I made:
>> >
>> > 1. I've consulted a board member and he recommended lazy consensus as
>> > opposed to voting. The reason being in voting there can easily be a
>> "loser'
>> > that gets outvoted.
>> >
>> > 2. I made it lighter weight, and renamed "strategy" to "optional design
>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>> tagging
>> > things and linking them elsewhere simply having design docs and
>> prototypes
>> > implementations in PRs is not something that has not worked so far".
>> >
>> > 3. I made some the language tweaks to focus more on visibility. For
>> example,
>> > "The purpose of an SIP is to inform and involve", rather than just
>> > "involve". SIPs should also have at least two emails that go to dev@.
>> >
>> >
>> > While I was editing this, I thought we really needed a suggested
>> template
>> > for design doc too. I will get to that too ...
>> >
>> >
>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin 
>> wrote:
>> >>
>> >> Most things looked OK to me too, although I do plan to take a closer
>> look
>> >> after Nov 1st when we cut the release branch for 2.1.
>> >>
>> >>
>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin 
>> >> wrote:
>> >>>
>> >>> The proposal looks OK to me. I assume, even though it's not explicitly
>> >>> called, that voting would happen by e-mail? A template for the
>> >>> proposal document (instead of just a bullet nice) would also be nice,
>> >>> but that can be done at any time.
>> >>>
>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>> >>> for a SIP, given the scope of the work. The document attached even
>> >>> somewhat matches the proposed format. So if anyone wants to try out
>> >>> the process...
>> >>>
>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger 
>> >>> wrote:
>> >>> > Now that spark summit europe is over, are any committers interested
>> in
>> >>> > moving forward with this?
>> >>> >
>> >>> >
>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >>> >
>> >>> > Or are we going to let this discussion die on the vine?
>> >>> >
>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >>> >  wrote:
>> >>> >> Maybe my mail was not clear enough.
>> >>> >>
>> >>> >>
>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> framework.
>> >>> >> The
>> >>> >> idea with benchmarks was to show two things:
>> >>> >>
>> >>> >> - why some people are 

Re: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
Oops. Let me try figure that out.

On Monday, November 7, 2016, Cody Koeninger  wrote:

> Thanks for picking up on this.
>
> Maybe I fail at google docs, but I can't see any edits on the document
> you linked.
>
> Regarding lazy consensus, if the board in general has less of an issue
> with that, sure.  As long as it is clearly announced, lasts at least
> 72 hours, and has a clear outcome.
>
> The other points are hard to comment on without being able to see the
> text in question.
>
>
> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin  > wrote:
> > I just looked through the entire thread again tonight - there are a lot
> of
> > great ideas being discussed. Thanks Cody for taking the first crack at
> the
> > proposal.
> >
> > I want to first comment on the context. Spark is one of the most
> innovative
> > and important projects in (big) data -- overall technical decisions made
> in
> > Apache Spark are sound. But of course, a project as large and active as
> > Spark always have room for improvement, and we as a community should
> strive
> > to take it to the next level.
> >
> > To that end, the two biggest areas for improvements in my opinion are:
> >
> > 1. Visibility: There are so much happening that it is difficult to know
> what
> > really is going on. For people that don't follow closely, it is
> difficult to
> > know what the important initiatives are. Even for people that do follow,
> it
> > is difficult to know what specific things require their attention, since
> the
> > number of pull requests and JIRA tickets are high and it's difficult to
> > extract signal from noise.
> >
> > 2. Solicit user (broadly defined, including developers themselves) input
> > more proactively: At the end of the day the project provides value
> because
> > users use it. Users can't tell us exactly what to build, but it is
> important
> > to get their inputs.
> >
> >
> > I've taken Cody's doc and edited it:
> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> > (I've made all my modifications trackable)
> >
> > There are couple high level changes I made:
> >
> > 1. I've consulted a board member and he recommended lazy consensus as
> > opposed to voting. The reason being in voting there can easily be a
> "loser'
> > that gets outvoted.
> >
> > 2. I made it lighter weight, and renamed "strategy" to "optional design
> > sketch". Echoing one of the earlier email: "IMHO so far aside from
> tagging
> > things and linking them elsewhere simply having design docs and
> prototypes
> > implementations in PRs is not something that has not worked so far".
> >
> > 3. I made some the language tweaks to focus more on visibility. For
> example,
> > "The purpose of an SIP is to inform and involve", rather than just
> > "involve". SIPs should also have at least two emails that go to dev@.
> >
> >
> > While I was editing this, I thought we really needed a suggested template
> > for design doc too. I will get to that too ...
> >
> >
> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin  > wrote:
> >>
> >> Most things looked OK to me too, although I do plan to take a closer
> look
> >> after Nov 1st when we cut the release branch for 2.1.
> >>
> >>
> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin  >
> >> wrote:
> >>>
> >>> The proposal looks OK to me. I assume, even though it's not explicitly
> >>> called, that voting would happen by e-mail? A template for the
> >>> proposal document (instead of just a bullet nice) would also be nice,
> >>> but that can be done at any time.
> >>>
> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
> >>> for a SIP, given the scope of the work. The document attached even
> >>> somewhat matches the proposed format. So if anyone wants to try out
> >>> the process...
> >>>
> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger  >
> >>> wrote:
> >>> > Now that spark summit europe is over, are any committers interested
> in
> >>> > moving forward with this?
> >>> >
> >>> >
> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>> >
> >>> > Or are we going to let this discussion die on the vine?
> >>> >
> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >>> > > wrote:
> >>> >> Maybe my mail was not clear enough.
> >>> >>
> >>> >>
> >>> >> I didn't want to write "lets focus on Flink" or any other framework.
> >>> >> The
> >>> >> idea with benchmarks was to show two things:
> >>> >>
> >>> >> - why some people are doing bad PR for Spark
> >>> >>
> >>> >> - how - in easy way - we can change it and show that Spark is still
> on
> >>> >> the
> >>> >> top
> >>> >>
> >>> >>
> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
> >>> >> they're the
> >>> >> most important 

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Cody Koeninger
Thanks for picking up on this.

Maybe I fail at google docs, but I can't see any edits on the document
you linked.

Regarding lazy consensus, if the board in general has less of an issue
with that, sure.  As long as it is clearly announced, lasts at least
72 hours, and has a clear outcome.

The other points are hard to comment on without being able to see the
text in question.


On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin  wrote:
> I just looked through the entire thread again tonight - there are a lot of
> great ideas being discussed. Thanks Cody for taking the first crack at the
> proposal.
>
> I want to first comment on the context. Spark is one of the most innovative
> and important projects in (big) data -- overall technical decisions made in
> Apache Spark are sound. But of course, a project as large and active as
> Spark always have room for improvement, and we as a community should strive
> to take it to the next level.
>
> To that end, the two biggest areas for improvements in my opinion are:
>
> 1. Visibility: There are so much happening that it is difficult to know what
> really is going on. For people that don't follow closely, it is difficult to
> know what the important initiatives are. Even for people that do follow, it
> is difficult to know what specific things require their attention, since the
> number of pull requests and JIRA tickets are high and it's difficult to
> extract signal from noise.
>
> 2. Solicit user (broadly defined, including developers themselves) input
> more proactively: At the end of the day the project provides value because
> users use it. Users can't tell us exactly what to build, but it is important
> to get their inputs.
>
>
> I've taken Cody's doc and edited it:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> (I've made all my modifications trackable)
>
> There are couple high level changes I made:
>
> 1. I've consulted a board member and he recommended lazy consensus as
> opposed to voting. The reason being in voting there can easily be a "loser'
> that gets outvoted.
>
> 2. I made it lighter weight, and renamed "strategy" to "optional design
> sketch". Echoing one of the earlier email: "IMHO so far aside from tagging
> things and linking them elsewhere simply having design docs and prototypes
> implementations in PRs is not something that has not worked so far".
>
> 3. I made some the language tweaks to focus more on visibility. For example,
> "The purpose of an SIP is to inform and involve", rather than just
> "involve". SIPs should also have at least two emails that go to dev@.
>
>
> While I was editing this, I thought we really needed a suggested template
> for design doc too. I will get to that too ...
>
>
> On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin  wrote:
>>
>> Most things looked OK to me too, although I do plan to take a closer look
>> after Nov 1st when we cut the release branch for 2.1.
>>
>>
>> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin 
>> wrote:
>>>
>>> The proposal looks OK to me. I assume, even though it's not explicitly
>>> called, that voting would happen by e-mail? A template for the
>>> proposal document (instead of just a bullet nice) would also be nice,
>>> but that can be done at any time.
>>>
>>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> for a SIP, given the scope of the work. The document attached even
>>> somewhat matches the proposed format. So if anyone wants to try out
>>> the process...
>>>
>>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger 
>>> wrote:
>>> > Now that spark summit europe is over, are any committers interested in
>>> > moving forward with this?
>>> >
>>> >
>>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >
>>> > Or are we going to let this discussion die on the vine?
>>> >
>>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> >  wrote:
>>> >> Maybe my mail was not clear enough.
>>> >>
>>> >>
>>> >> I didn't want to write "lets focus on Flink" or any other framework.
>>> >> The
>>> >> idea with benchmarks was to show two things:
>>> >>
>>> >> - why some people are doing bad PR for Spark
>>> >>
>>> >> - how - in easy way - we can change it and show that Spark is still on
>>> >> the
>>> >> top
>>> >>
>>> >>
>>> >> No more, no less. Benchmarks will be helpful, but I don't think
>>> >> they're the
>>> >> most important thing in Spark :) On the Spark main page there is still
>>> >> chart
>>> >> "Spark vs Hadoop". It is important to show that framework is not the
>>> >> same
>>> >> Spark with other API, but much faster and optimized, comparable or
>>> >> even
>>> >> faster than other frameworks.
>>> >>
>>> >>
>>> >> About real-time streaming, I think it would be just good to see it in
>>> >> Spark.
>>> >> I very like current Spark model, but many voices that says 

Re: Handling questions in the mailing lists

2016-11-07 Thread Ricardo Almeida
Thanks Reynold for reviewing the ASF rules.
Albeit the potential issues mentioned, I feel using StackOverflow would be
a improvement. And yes, some guidelines/instructions have the potential to
improve the questions and the "escalation" process.

On 7 November 2016 at 10:48,  wrote:

> My two cents (As a user/consumer)…
>
>
>
> I have been following & using Spark in financial services before version 1
> and before it migrated questions from Google Groups to apache mailing lists
> (which was a shame L ).
>
>
>
> SO:
>
> There has been some momentum lately on SO, but as questions were not
> “monitored/answered” by Spark experts, the motivation of posting a question
> was low and in turn the quality of questions as well. As most of us know,
> SO is usually the first place to look for info and can greatly reduce the
> need to turn to user/dev groups so it would be great if there was more
> attention to it.
>
>
>
> Spark mailing lists:
>
> As the consensus appears to be, questions tend to get lost if not
> picked-up within 1-2 days. Re-sending the same question feels “abusive” to
> me so would then give up. Provided that a good question takes time, putting
> effort in a question that can easily be ignored results to mailing a “bad”
> question (see what happens?) or no question at all. As you have probably
> observed, a few users will mail a question to “dev” with “…no answers in
> user list…” as they incorrectly assume that no-one can answer their
> question.
>
>
>
> JIRA:
>
> I find that “issues” are being quite aggressively closed down.  I’ve seen
> this twice (one I reported myself and found the second ticket while looking
> for a solution) and for this reason it doesn’t encourage users spending the
> time and effort to use. Personally, I also feel that there is some bias on
> what is in-scope and out-of-scope.
>
>
>
> My preference would be that SO would be the first place that someone would
> post a question. If a few “experts” are found regularly answering
> questions, eventually Spark users will start using it more and reduce
> “user” load by easily finding previous answers (or SO community marking a
> duplicates). The same “experts” can also encourage users to “escalate” to
> JIRA, dev/user groups once a question has been properly filtered which is
> quite common.
>
>
>
> PS. Personally, I would not follow any “bespoke/external” process on SO
> E.g. down-voting on SO for any other reason that being a bad question as
> per SO rules.
>
>
>
>
>
> *From:* Matei Zaharia [mailto:matei.zaha...@gmail.com]
> *Sent:* 07 November 2016 07:45
> *To:* assaf.mendelson
> *Cc:* dev@spark.apache.org
>
> *Subject:* Re: Handling questions in the mailing lists
>
>
>
> Even for the mailing list, I'd love to have a short set of instructions on
> how to submit your questions (maybe on http://spark.apache.org/
> community.html
> 
> or maybe in the welcome email when you subscribe). It would be great if
> someone added that. After all, we have such instructions for contributing
> PRs, for example.
>
>
>
> Matei
>
>
>
> On Nov 6, 2016, at 11:09 PM, assaf.mendelson 
> wrote:
>
>
>
> There are other options as well. For example hosting an answerhub (
> www.answerhub.com
> )
> or other similar separate Q service.
>
> BTW, I believe the main issue is not how opinionated people are but who is
> answering questions.
>
> Today there are already people asking (and getting answers) on SO
> (including myself). The problem is that many people do not go to SO.
>
> The problem I see is how to “bump” up questions which are not being
> answered to someone more likely to be able to answer them. Simple questions
> can be answered by many people, many of them even newbies who ran into the
> issue themselves.
>
> The main issue is that the more complex the question, the less people
> there are who can answer it and those people’s bandwidth is already clogged
> by other questions.
>
> We could for example try to create tags on SO for “basic questions”,
> “medium”, “advanced”. Provide guidelines to ask first on basic, if not
> answered after X days then add the medium tag etc. Downvote people who
> don’t go by the process. This would mean that committers for example can
> look at advanced only tag and have a manageable number of questions they
> can help with while others can answer medium and basic.
>
>
>
> I agree that some things are not good for SO. Basically stuff which asks
> for opinion is such but most cases in 

RE: Handling questions in the mailing lists

2016-11-07 Thread Ioannis.Deligiannis
My two cents (As a user/consumer)…

I have been following & using Spark in financial services before version 1 and 
before it migrated questions from Google Groups to apache mailing lists (which 
was a shame ☹ ).

SO:
There has been some momentum lately on SO, but as questions were not 
“monitored/answered” by Spark experts, the motivation of posting a question was 
low and in turn the quality of questions as well. As most of us know, SO is 
usually the first place to look for info and can greatly reduce the need to 
turn to user/dev groups so it would be great if there was more attention to it.

Spark mailing lists:
As the consensus appears to be, questions tend to get lost if not picked-up 
within 1-2 days. Re-sending the same question feels “abusive” to me so would 
then give up. Provided that a good question takes time, putting effort in a 
question that can easily be ignored results to mailing a “bad” question (see 
what happens?) or no question at all. As you have probably observed, a few 
users will mail a question to “dev” with “…no answers in user list…” as they 
incorrectly assume that no-one can answer their question.

JIRA:
I find that “issues” are being quite aggressively closed down.  I’ve seen this 
twice (one I reported myself and found the second ticket while looking for a 
solution) and for this reason it doesn’t encourage users spending the time and 
effort to use. Personally, I also feel that there is some bias on what is 
in-scope and out-of-scope.

My preference would be that SO would be the first place that someone would post 
a question. If a few “experts” are found regularly answering questions, 
eventually Spark users will start using it more and reduce “user” load by 
easily finding previous answers (or SO community marking a duplicates). The 
same “experts” can also encourage users to “escalate” to JIRA, dev/user groups 
once a question has been properly filtered which is quite common.

PS. Personally, I would not follow any “bespoke/external” process on SO E.g. 
down-voting on SO for any other reason that being a bad question as per SO 
rules.


From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: 07 November 2016 07:45
To: assaf.mendelson
Cc: dev@spark.apache.org
Subject: Re: Handling questions in the mailing lists

Even for the mailing list, I'd love to have a short set of instructions on how 
to submit your questions (maybe on 
http://spark.apache.org/community.html
 or maybe in the welcome email when you subscribe). It would be great if 
someone added that. After all, we have such instructions for contributing PRs, 
for example.

Matei

On Nov 6, 2016, at 11:09 PM, assaf.mendelson 
> wrote:

There are other options as well. For example hosting an answerhub 
(www.answerhub.com)
 or other similar separate Q service.
BTW, I believe the main issue is not how opinionated people are but who is 
answering questions.
Today there are already people asking (and getting answers) on SO (including 
myself). The problem is that many people do not go to SO.
The problem I see is how to “bump” up questions which are not being answered to 
someone more likely to be able to answer them. Simple questions can be answered 
by many people, many of them even newbies who ran into the issue themselves.
The main issue is that the more complex the question, the less people there are 
who can answer it and those people’s bandwidth is already clogged by other 
questions.
We could for example try to create tags on SO for “basic questions”, “medium”, 
“advanced”. Provide guidelines to ask first on basic, if not answered after X 
days then add the medium tag etc. Downvote people who don’t go by the process. 
This would mean that committers for example can look at advanced only tag and 
have a manageable number of questions they can help with while others can 
answer medium and basic.

I agree that some things are not good for SO. Basically stuff which asks for 
opinion is such but most cases in the mailing list are either “how do I solve 
this bug” or “how do I do X”. Either of those two are good for SO.


Assaf.



From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden 
email]]
Sent: Monday, November 07, 2016 8:33 AM
To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists

This is an excellent point. If we do go ahead and feature SO as a way for users 
to ask questions more prominently, as someone who knows SO very well, would you 
be 

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
I just looked through the entire thread again tonight - there are a lot of
great ideas being discussed. Thanks Cody for taking the first crack at the
proposal.

I want to first comment on the context. Spark is one of the most innovative
and important projects in (big) data -- overall technical decisions made in
Apache Spark are sound. But of course, a project as large and active as
Spark always have room for improvement, and we as a community should strive
to take it to the next level.

To that end, the two biggest areas for improvements in my opinion are:

1. Visibility: There are so much happening that it is difficult to know
what really is going on. For people that don't follow closely, it is
difficult to know what the important initiatives are. Even for people that
do follow, it is difficult to know what specific things require their
attention, since the number of pull requests and JIRA tickets are high and
it's difficult to extract signal from noise.

2. Solicit user (broadly defined, including developers themselves) input
more proactively: At the end of the day the project provides value because
users use it. Users can't tell us exactly what to build, but it is
important to get their inputs.


I've taken Cody's doc and edited it:
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
 (I've made all my modifications trackable)

There are couple high level changes I made:

1. I've consulted a board member and he recommended lazy consensus as
opposed to voting. The reason being in voting there can easily be a "loser'
that gets outvoted.

2. I made it lighter weight, and renamed "strategy" to "optional design
sketch". Echoing one of the earlier email: "IMHO so far aside from tagging
things and linking them elsewhere simply having design docs and prototypes
implementations in PRs is not something that has not worked so far".

3. I made some the language tweaks to focus more on visibility. For
example, "The purpose of an SIP is to inform and involve", rather than just
"involve". SIPs should also have at least two emails that go to dev@.


While I was editing this, I thought we really needed a suggested template
for design doc too. I will get to that too ...


On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin  wrote:

> Most things looked OK to me too, although I do plan to take a closer look
> after Nov 1st when we cut the release branch for 2.1.
>
>
> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin 
> wrote:
>
>> The proposal looks OK to me. I assume, even though it's not explicitly
>> called, that voting would happen by e-mail? A template for the
>> proposal document (instead of just a bullet nice) would also be nice,
>> but that can be done at any time.
>>
>> BTW, shameless plug: I filed SPARK-18085
>>  which I consider a
>> candidate
>> for a SIP, given the scope of the work. The document attached even
>> somewhat matches the proposed format. So if anyone wants to try out
>> the process...
>>
>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger 
>> wrote:
>> > Now that spark summit europe is over, are any committers interested in
>> > moving forward with this?
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >
>> > Or are we going to let this discussion die on the vine?
>> >
>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >  wrote:
>> >> Maybe my mail was not clear enough.
>> >>
>> >>
>> >> I didn't want to write "lets focus on Flink" or any other framework.
>> The
>> >> idea with benchmarks was to show two things:
>> >>
>> >> - why some people are doing bad PR for Spark
>> >>
>> >> - how - in easy way - we can change it and show that Spark is still on
>> the
>> >> top
>> >>
>> >>
>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> they're the
>> >> most important thing in Spark :) On the Spark main page there is still
>> chart
>> >> "Spark vs Hadoop". It is important to show that framework is not the
>> same
>> >> Spark with other API, but much faster and optimized, comparable or even
>> >> faster than other frameworks.
>> >>
>> >>
>> >> About real-time streaming, I think it would be just good to see it in
>> Spark.
>> >> I very like current Spark model, but many voices that says "we need
>> more" -
>> >> community should listen also them and try to help them. With SIPs it
>> would
>> >> be easier, I've just posted this example as "thing that may be changed
>> with
>> >> SIP".
>> >>
>> >>
>> >> I very like unification via Datasets, but there is a lot of algorithms
>> >> inside - let's make easy API, but with strong background (articles,
>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
>> >> framework.
>> >>
>> >>
>> >> Maybe now my intention will be clearer :) As I said organizational
>> ideas
>> >> were