Re: Handling questions in the mailing lists

2016-11-09 Thread Maciej Szymkiewicz
If you take a look at the statistics
(https://data.stackexchange.com/stackoverflow/query/575406) you'll see
that majority of the unanswered questions:

  * have seen no activity in the last year OR
  * don't have positive score OR
  * have been asked by inactive or new users.

This is usually a good indicator that question is poor quality and / or
abandoned and for different reasons hasn't been picked by the removal
process (https://stackoverflow.com/help/roomba). This is not unusual for
Stack Overflow and with a little bit of organized effort could be
cleaned in a few weeks.

Arguably, for a technology with a large number of moving parts, Spark
has pretty decent /answer rate/ and definitely better than many
comparable projects.

Regarding tagging. Putting community rules aside clean questions which
can be answered with relatively low effort are usually resolved in a few
days. What is left is either to time consuming or complex or just not
not worth the time. If you have a lot of time the former ones can be
easily selected using predefined filters and the rest usually qualifies
for closing.

Still, I believe there is a really important missing point here. All of
that requires a lot of effort and it is slightly unrealistic to expect 
that the number of people willing and having time to contribute will
suddenly grow. So the focus should be on having a knowledge base which
can reduce number of questions to be answered. SO has good visibility,
large number of existing answers, and very good tools. 

On 11/09/2016 08:02 AM, assaf.mendelson wrote:
>
> I like the document and I think it is good but I still feel like we
> are missing an important part here.
>
>  
>
> Look at SO today. There are:
>
> -   4658 unanswered questions under apache-spark tag.
>
> -  394 unanswered questions under spark-dataframe tag.
>
> -  639 unanswered questions under apache-spark-sql
>
> -  859 unanswered questions under pyspark
>
>  
>
> Just moving people to ask there will not help. The whole issue is
> having people answer the questions.
>
>  
>
> The problem is that many of these questions do not fit SO (but are
> already there so they are noise), are bad (i.e. unclear or hard to
> answer), orphaned etc. while some are simply harder than what people
> with some experience in spark can handle and require more expertise.
>
> The problem is that people with the relevant expertise are drowning in
> noise. This. Is true for the mailing list and this is true for SO.
>
>  
>
> For this reason I believe that just moving people to SO will not solve
> anything.
>
>  
>
> My original thought was that if we had different tags then different
> people could watch open questions on these tags and therefore have a
> much lower noise. I thought that we would have a low tier (current
> one) of people just not following the documentation (which would
> remain as noise), then a beginner tier where we could have people
> downvoting bad questions but in most cases the community can answer
> the questions because they are common, then a “medium” tier which
> would mean harder questions but that can still be answered by advanced
> users and lastly an “advanced” tier to which committers can actually
> subscribed to (and adding sub tags for subsystems would improve this
> even more).
>
>  
>
> I was not aware of SO policy for meta tags (the burnination link is
> about removing tags completely so I am not sure how it applies, I
> believe this link
> https://stackoverflow.blog/2010/08/the-death-of-meta-tags/ is more
> relevant).
>
> There was actually a discussion along the lines in SO
> (http://meta.stackoverflow.com/questions/253338/filtering-questions-by-difficulty-level).
>
>  
>
> The fact that SO did not solve this issue, does not mean we shouldn’t
> either.
>
>  
>
> The way I see it, some tags can easily be used even with the meta tags
> limitation. For example, using spark-internal-development tag can be
> used to ask questions for development of spark. There are already tags
> for some spark subsystems (there is a apachae-spark-sql tag, a pyspark
> tag, a spark-streaming tag etc.). The main issue I see and the one we
> can’t seem to get around is dividing between simple questions that the
> community should answer and hard questions which only advanced users
> can answer.
>
>  
>
> Maybe SO isn’t the correct platform for that but even within it we can
> try to find a non meta name for spark beginner questions vs. spark
> advanced questions.
>
> Assaf.
>
>  
>
>  
>
> *From:*Denny Lee [via Apache Spark Developers List]
> [mailto:ml-node+[hidden email]
> ]
> *Sent:* Tuesday, November 08, 2016 7:53 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Handling questions in the mailing lists
>
>  
>
> To help track and get the verbiage for the Spark community page and
> welcome email jump started, here's a working document for us to work
> with: 
> 

Contributing to Spark in GSoC 2017

2016-11-09 Thread Krishna Kalyan
Hello,
I am Krishna, currently a 2nd year Masters student in (MSc. in Data Mining)
currently in Barcelona studying at Université Polytechnique de Catalogne.
I know its a little early for GSoC, however I wanted to get  a head start
working with the spark community.
Is there anyone who would be mentoring GSoC 2017?.
Could anyone please guide on how to go about it?.

Related Experience:
My masters is mostly focussed on data mining and machine learning
techniques. Before my masters, I was a  data engineer with IBM (India). I
was responsible for managing 50 node Hadoop Cluster for more than a year.
Most of my time was spent optimising and writing ETL (Apache Pig) jobs. Our
daily batch job aggregated more than 30gbs of CDR+Weblogs in our cluster.

I am the most comfortable with Python and R. (Not a Scala expert, I am sure
that I can pick it up quickly)

 My CV could be viewed by following the link below.
(https://github.com/krishnakalyan3/Resume/raw/master/Resume.pdf)

My Spark Pull Requests
(
https://github.com/apache/spark/pulls?utf8=%E2%9C%93=is%3Apr%20author%3Akrishnakalyan3%20
)

Thank you so much,
Krishna


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Pratik Sharma
+1 (non-binding)

> On Nov 9, 2016, at 3:15 PM, Ryan Blue  wrote:
> 
> +1
> 
>> On Wed, Nov 9, 2016 at 1:14 PM, Yin Huai  wrote:
>> +1
>> 
>>> On Wed, Nov 9, 2016 at 1:14 PM, Yin Huai  wrote:
>>> +!
>>> 
 On Wed, Nov 9, 2016 at 1:02 PM, Denny Lee  wrote:
 +1 (non binding)
 
 
 
> On Tue, Nov 8, 2016 at 10:14 PM vaquar khan  wrote:
> +1 (non binding)
> 
> On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang  
> wrote:
>  +1 (non binding)
> 
> Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version 
> "1.8.0_111" 
>  
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
> -Dpyspark -Dsparkr -DskipTests clean package
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
> -Dpyspark -Dsparkr test
>  
> 
> On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin  wrote:
> +1 (non-binding)
> 
> Cheers,
> Liwei
> 
> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida 
>  wrote:
> +1 (non-binding)
> 
> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3, 
> YARN, Hive
> 
> 
> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier 
>  wrote:
> +1
> 
> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes 
> if a majority of at least 3+1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
> 
> 
> The tag to be voted on is v2.0.2-rc3 
> (584354eaac02531c9584188b143367ba694b0c34)
> 
> This release candidate resolves 84 issues: 
> https://s.apache.org/spark-2.0.2-jira
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
> 
> 
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking 
> an existing Spark workload and running on this release candidate, then 
> reporting any regressions from 2.0.1.
> 
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already 
> present in 2.0.1, missing features, or bugs related to new features will 
> not necessarily block this release.
> 
> Q: What fix version should I use for patches merging into branch-2.0 from 
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC 
> (i.e. RC4) is cut, I will change the fix version of those patches to 
> 2.0.2.
> 
> 
> 
> 
> 
> 
> 
> -- 
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> 
> IT Architect / Lead Consultant 
> Greater Chicago
>>> 
>> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Ryan Blue
+1

On Wed, Nov 9, 2016 at 1:14 PM, Yin Huai  wrote:

> +1
>
> On Wed, Nov 9, 2016 at 1:14 PM, Yin Huai  wrote:
>
>> +!
>>
>> On Wed, Nov 9, 2016 at 1:02 PM, Denny Lee  wrote:
>>
>>> +1 (non binding)
>>>
>>>
>>>
>>> On Tue, Nov 8, 2016 at 10:14 PM vaquar khan 
>>> wrote:
>>>
 *+1 (non binding)*

 On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang  wrote:

  +1 (non binding)


 Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
 "1.8.0_111"



 ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 -Phive-thriftserver -Dpyspark -Dsparkr -DskipTests clean package

 ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 -Phive-thriftserver -Dpyspark -Dsparkr test



 On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin  wrote:

 +1 (non-binding)

 Cheers,
 Liwei

 On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
 ricardo.alme...@actnowib.com> wrote:

 +1 (non-binding)

 over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
 YARN, Hive


 On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
 hvanhov...@databricks.com> wrote:

 +1

 On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin 
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 version 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and
 passes if a majority of at least 3+1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.0.2
 [ ] -1 Do not release this package because ...


 The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
 7ba694b0c34)

 This release candidate resolves 84 issues:
 https://s.apache.org/spark-2.0.2-jira

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1214/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/


 Q: How can I help test this release?
 A: If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions from 2.0.1.

 Q: What justifies a -1 vote for this release?
 A: This is a maintenance release in the 2.0.x series. Bugs already
 present in 2.0.1, missing features, or bugs related to new features will
 not necessarily block this release.

 Q: What fix version should I use for patches merging into branch-2.0
 from now on?
 A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
 (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.








 --
 Regards,
 Vaquar Khan
 +1 -224-436-0783 <(224)%20436-0783>

 IT Architect / Lead Consultant
 Greater Chicago

>>>
>>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Yin Huai
+1

On Wed, Nov 9, 2016 at 1:14 PM, Yin Huai  wrote:

> +!
>
> On Wed, Nov 9, 2016 at 1:02 PM, Denny Lee  wrote:
>
>> +1 (non binding)
>>
>>
>>
>> On Tue, Nov 8, 2016 at 10:14 PM vaquar khan 
>> wrote:
>>
>>> *+1 (non binding)*
>>>
>>> On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang 
>>> wrote:
>>>
>>>  +1 (non binding)
>>>
>>>
>>> Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
>>> "1.8.0_111"
>>>
>>>
>>>
>>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>> -Dpyspark -Dsparkr -DskipTests clean package
>>>
>>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>> -Dpyspark -Dsparkr test
>>>
>>>
>>>
>>> On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> Cheers,
>>> Liwei
>>>
>>> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
>>> ricardo.alme...@actnowib.com> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
>>> YARN, Hive
>>>
>>>
>>> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
>>> +1
>>>
>>> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>>> 7ba694b0c34)
>>>
>>> This release candidate resolves 84 issues:
>>> https://s.apache.org/spark-2.0.2-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.1.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>> present in 2.0.1, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783 <(224)%20436-0783>
>>>
>>> IT Architect / Lead Consultant
>>> Greater Chicago
>>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Yin Huai
+!

On Wed, Nov 9, 2016 at 1:02 PM, Denny Lee  wrote:

> +1 (non binding)
>
>
>
> On Tue, Nov 8, 2016 at 10:14 PM vaquar khan  wrote:
>
>> *+1 (non binding)*
>>
>> On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang 
>> wrote:
>>
>>  +1 (non binding)
>>
>>
>> Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
>> "1.8.0_111"
>>
>>
>>
>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dpyspark -Dsparkr -DskipTests clean package
>>
>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dpyspark -Dsparkr test
>>
>>
>>
>> On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin  wrote:
>>
>> +1 (non-binding)
>>
>> Cheers,
>> Liwei
>>
>> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
>> ricardo.alme...@actnowib.com> wrote:
>>
>> +1 (non-binding)
>>
>> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
>> YARN, Hive
>>
>>
>> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>> +1
>>
>> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
>> ba694b0c34)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
>> 0.2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783 <(224)%20436-0783>
>>
>> IT Architect / Lead Consultant
>> Greater Chicago
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Denny Lee
+1 (non binding)



On Tue, Nov 8, 2016 at 10:14 PM vaquar khan  wrote:

> *+1 (non binding)*
>
> On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang 
> wrote:
>
>  +1 (non binding)
>
>
> Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
> "1.8.0_111"
>
>
>
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr -DskipTests clean package
>
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr test
>
>
>
> On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin  wrote:
>
> +1 (non-binding)
>
> Cheers,
> Liwei
>
> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
> ricardo.alme...@actnowib.com> wrote:
>
> +1 (non-binding)
>
> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
> YARN, Hive
>
>
> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
> +1
>
> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc3
> (584354eaac02531c9584188b143367ba694b0c34)
>
> This release candidate resolves 84 issues:
> https://s.apache.org/spark-2.0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>
>
>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783 <(224)%20436-0783>
>
> IT Architect / Lead Consultant
> Greater Chicago
>


Re: Handling questions in the mailing lists

2016-11-09 Thread Denny Lee
Here here! :)  Completely agree with you - here's the latest updates
to Proposed
Community Mailing Lists / StackOverflow Changes
.
Keep them coming though at this point, I'd like to limit new verbiage to
prevent it from being too long hence not being read.  Modifications and
suggestions are absolutely welcome - just asking that we don't make it too
much longer.  Thanks!


On Wed, Nov 9, 2016 at 5:36 AM Gerard Maas  wrote:

> Great discussion. Glad to see it happening and lucky to have seen it on
> the mailing list due to its high volume.
>
> I had this same conversation with Patrick Wendell few Spark Summits ago.
> At the time, SO was not even listed as a resource and the idea was to make
> it the primary "go-to" place for questions.
>
> Having contributed to both the list (in its early days) and SO, the
> biggest hurdle IMO is how to deal with lazy people. These days, at SO, I
> spend more time leaving comments than answering in an attempt to moderate
> the requirement of "show some effort" and clarify unclear questions.
>
> It's my impression that the mailing list is much more friendly with "plz
> send me da code" folk and indeed would answer questions that would
> otherwise get down-voted or closed at SO. That also shows in the high email
> volume, which at the same time lowers its value for many of us who get
> overwhelmed. It's hard to separate authentic efforts in getting started,
> which deserve help and encouraging vs moderating "work dumpers" that abuse
> resources to get their thing done. Also, beginner questions always repeat
> and a mailing list has no features to help with that.
>
> The model I had in imagined roughly follows the "Odersky scale":
>  - Users new with the technology and basic "how to" questions belong in
> Stack Overflow. => The search and de-duplication features should help in
> getting an answer if already present, reducing the load.
>  - Advanced discussions and troubleshooting belong in users@
>  - Library bugs, new features and improvements belong in dev@
>
> Off course, there's no hard line between these levels and it would require
> contributor discretion aided with some routing procedure:
>
> - Spark documentation should establish Stack Overflow as the main go-to
> resource.
> - Contributors on the list should friendly redirect "intro level
> questions" to Stack Overflow.
> - SO contributors should redirect potential bugs and questions deserving a
> deeper discussion to @users or @dev as needed
> - @users -> @dev as today
> - Cross-posting SO + @users should be discouraged. The idea is to create
> efficient channels.
>
> A good resource on how and where to ask questions would be a great routing
> channel between the levels above.
> I'm willing to help with moderation efforts on "Spark Overflow" :-) to get
> this going.
>
> The Spark community has always been very welcoming and that spirit should
> be preserved. We just need to channel the efforts in a more efficient way.
>
> my 2c,
>
> Gerard.
>
>
> On Mon, Nov 7, 2016 at 11:24 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
> Just a couple of random thoughts regarding Stack Overflow...
>
>- If we are thinking about shifting focus towards SO all attempts of
>micromanaging should be discarded right in the beginning. Especially things
>like meta tags, which are discouraged and "burninated" (
>https://meta.stackoverflow.com/tags/burninate-request/info) , or
>thread bumping. Depending on a context these won't be manageable, go
>against community guidelines or simply obsolete.
>- Lack of expertise is unlikely an issue. Even now there is a number
>of advanced Spark users on SO. Of course the more the merrier.
>
> Things that can be easily improved:
>
>- Identifying, improving and promoting canonical questions and
>answers. It means closing duplicate, suggesting edits to improve existing
>answers, providing alternative solutions. This can be also used to identify
>gaps in the documentation.
>- Providing a set of clear posting guidelines to reduce effort
>required to identify the problem (think about
>http://stackoverflow.com/q/5963269 a.k.a How to make a great R
>reproducible example?)
>- Helping users decide if question is a good fit for SO (see below).
>API questions are great fit, debugging problems like "my cluster is slow"
>are not.
>- Actively cleaning (closing, deleting) off-topic and low quality
>questions. The less junk to sieve through the better chance of good
>questions being answered.
>- Repurposing and actively moderating SO docs (
>https://stackoverflow.com/documentation/apache-spark/topics). Right
>now most of the stuff that goes there is useless, duplicated or
>plagiarized, or border case SPAM.
>- Encouraging community to monitor featured (
>

Re: Would "alter table add column" be supported in the future?

2016-11-09 Thread Herman van Hövell tot Westerflier
This currently not on any roadmap I know of. You can open a JIRA ticket for
this if you want to.

On Wed, Nov 9, 2016 at 6:02 PM, 汪洋  wrote:

> Hi,
>
> I notice that “alter table add column” command is banned in spark 2.0.
>
> Any plans on supporting it in the future? (After all it was supported in
> spark 1.6.x)
>
> Thanks.
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Connectors using new Kafka consumer API

2016-11-09 Thread Cody Koeninger
Ok... in general it seems to me like effort would be better spent
trying to help upstream, as opposed to us making a 5th slightly
different interface to kafka (currently have 0.8 receiver, 0.8
dstream, 0.10 dstream, 0.10 structured stream)

On Tue, Nov 8, 2016 at 10:05 PM, Mark Grover  wrote:
> I think they are open to others helping, in fact, more than one person has
> worked on the JIRA so far. And, it's been crawling really slowly and that's
> preventing adoption of Spark's new connector in secure Kafka environments.
>
> On Tue, Nov 8, 2016 at 7:59 PM, Cody Koeninger  wrote:
>>
>> Have you asked the assignee on the Kafka jira whether they'd be
>> willing to accept help on it?
>>
>> On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover  wrote:
>> > Hi all,
>> > We currently have a new direct stream connector, thanks to work by Cody
>> > and
>> > others on SPARK-12177.
>> >
>> > However, that can't be used in secure clusters that require Kerberos
>> > authentication. That's because Kafka currently doesn't support
>> > delegation
>> > tokens (KAFKA-1696). Unfortunately, very little work has been done on
>> > that
>> > JIRA, so, in my opinion, folks who want to use secure Kafka (using the
>> > norm
>> > - Kerberos) can't do so because Spark Streaming can't consume from it
>> > today.
>> >
>> > The right way is, of course, to get delegation tokens in Kafka but
>> > honestly
>> > I don't know if that's happening in the near future. I am wondering if
>> > we
>> > should consider something to remedy this - for example, we could come up
>> > with a receiver based connector based on the new Kafka consumer API
>> > that'd
>> > support kerberos authentication. It won't require delegation tokens
>> > since
>> > there's only a very small number of executors talking to Kafka. Of
>> > course,
>> > for anyone who cares about high throughput and other direct connector
>> > benefits would have to use direct connector. Another thing we could do
>> > is
>> > ship the keytab to the executors in the direct connector, so delegation
>> > tokens are not required but the latter would be a pretty comprising
>> > solution, and I'd prefer not doing that.
>> >
>> > What do folks think? Would love to hear your thoughts, especially about
>> > the
>> > receiver.
>> >
>> > Thanks!
>> > Mark
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Would "alter table add column" be supported in the future?

2016-11-09 Thread 汪洋
Hi,

I notice that “alter table add column” command is banned in spark 2.0.

Any plans on supporting it in the future? (After all it was supported in spark 
1.6.x)

Thanks.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Handling questions in the mailing lists

2016-11-09 Thread Gerard Maas
Great discussion. Glad to see it happening and lucky to have seen it on the
mailing list due to its high volume.

I had this same conversation with Patrick Wendell few Spark Summits ago. At
the time, SO was not even listed as a resource and the idea was to make it
the primary "go-to" place for questions.

Having contributed to both the list (in its early days) and SO, the biggest
hurdle IMO is how to deal with lazy people. These days, at SO, I spend more
time leaving comments than answering in an attempt to moderate the
requirement of "show some effort" and clarify unclear questions.

It's my impression that the mailing list is much more friendly with "plz
send me da code" folk and indeed would answer questions that would
otherwise get down-voted or closed at SO. That also shows in the high email
volume, which at the same time lowers its value for many of us who get
overwhelmed. It's hard to separate authentic efforts in getting started,
which deserve help and encouraging vs moderating "work dumpers" that abuse
resources to get their thing done. Also, beginner questions always repeat
and a mailing list has no features to help with that.

The model I had in imagined roughly follows the "Odersky scale":
 - Users new with the technology and basic "how to" questions belong in
Stack Overflow. => The search and de-duplication features should help in
getting an answer if already present, reducing the load.
 - Advanced discussions and troubleshooting belong in users@
 - Library bugs, new features and improvements belong in dev@

Off course, there's no hard line between these levels and it would require
contributor discretion aided with some routing procedure:

- Spark documentation should establish Stack Overflow as the main go-to
resource.
- Contributors on the list should friendly redirect "intro level questions"
to Stack Overflow.
- SO contributors should redirect potential bugs and questions deserving a
deeper discussion to @users or @dev as needed
- @users -> @dev as today
- Cross-posting SO + @users should be discouraged. The idea is to create
efficient channels.

A good resource on how and where to ask questions would be a great routing
channel between the levels above.
I'm willing to help with moderation efforts on "Spark Overflow" :-) to get
this going.

The Spark community has always been very welcoming and that spirit should
be preserved. We just need to channel the efforts in a more efficient way.

my 2c,

Gerard.


On Mon, Nov 7, 2016 at 11:24 PM, Maciej Szymkiewicz 
wrote:

> Just a couple of random thoughts regarding Stack Overflow...
>
>- If we are thinking about shifting focus towards SO all attempts of
>micromanaging should be discarded right in the beginning. Especially things
>like meta tags, which are discouraged and "burninated" (
>https://meta.stackoverflow.com/tags/burninate-request/info
>) , or
>thread bumping. Depending on a context these won't be manageable, go
>against community guidelines or simply obsolete.
>- Lack of expertise is unlikely an issue. Even now there is a number
>of advanced Spark users on SO. Of course the more the merrier.
>
> Things that can be easily improved:
>
>- Identifying, improving and promoting canonical questions and
>answers. It means closing duplicate, suggesting edits to improve existing
>answers, providing alternative solutions. This can be also used to identify
>gaps in the documentation.
>- Providing a set of clear posting guidelines to reduce effort
>required to identify the problem (think about
>http://stackoverflow.com/q/5963269 
>a.k.a How to make a great R reproducible example?)
>- Helping users decide if question is a good fit for SO (see below).
>API questions are great fit, debugging problems like "my cluster is slow"
>are not.
>- Actively cleaning (closing, deleting) off-topic and low quality
>questions. The less junk to sieve through the better chance of good
>questions being answered.
>- Repurposing and actively moderating SO docs (
>https://stackoverflow.com/documentation/apache-spark/topics
>). Right
>now most of the stuff that goes there is useless, duplicated or
>plagiarized, or border case SPAM.
>- Encouraging community to monitor featured (https://stackoverflow.com/
>questions/tagged/apache-spark?sort=featured
>)
>and active & upvoted & unanswered (https://stackoverflow.com/
>unanswered/tagged/apache-spark) questions.
>- Implementing some procedure to identify questions which are likely
>to be bugs or a material for feature requests. Personally I am quite often
>tempted to simply send a link to dev list, but I don't think it is really
>

RE: Handling questions in the mailing lists

2016-11-09 Thread assaf.mendelson
I was just wondering, before we move on to SO.
Do we have enough contributors with enough reputation do manage things in SO?
We would need contributors with enough reputation to have relevant privilages.
For example: creating tags (requires 1500 reputation), edit questions and 
answers (2000), create tag synonums (2500), approve tag wiki edits (5000), 
access to moderator tools (1, this is required to delete questions etc.), 
protect questions (15000).
All of these are important if we plan to have SO as a main resource.
I know I originally suggested SO, however, if we do not have contributors with 
the required privileges and the willingness to help manage everything then I am 
not sure this is a good fit.
Assaf.

From: Denny Lee [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19799...@n3.nabble.com]
Sent: Wednesday, November 09, 2016 9:54 AM
To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists

Agreed that by simply just moving the questions to SO will not solve anything 
but I think the call out about the meta-tags is that we need to abide by SO 
rules and if we were to just jump in and start creating meta-tags, we would be 
violating at minimum the spirit and at maximum the actual conventions around SO.

Saying this, perhaps we could suggest tags that we place in the header of the 
question whether it be SO or the mailing lists that will help us sort through 
all of these questions faster just as you suggested.  The Proposed Community 
Mailing Lists / StackOverflow 
Changes
 has been updated to include suggested tags.  WDYT?

On Tue, Nov 8, 2016 at 11:02 PM assaf.mendelson <[hidden 
email]> wrote:
I like the document and I think it is good but I still feel like we are missing 
an important part here.

Look at SO today. There are:

-   4658 unanswered questions under apache-spark tag.

-  394 unanswered questions under spark-dataframe tag.

-  639 unanswered questions under apache-spark-sql

-  859 unanswered questions under pyspark

Just moving people to ask there will not help. The whole issue is having people 
answer the questions.

The problem is that many of these questions do not fit SO (but are already 
there so they are noise), are bad (i.e. unclear or hard to answer), orphaned 
etc. while some are simply harder than what people with some experience in 
spark can handle and require more expertise.
The problem is that people with the relevant expertise are drowning in noise. 
This. Is true for the mailing list and this is true for SO.

For this reason I believe that just moving people to SO will not solve anything.

My original thought was that if we had different tags then different people 
could watch open questions on these tags and therefore have a much lower noise. 
I thought that we would have a low tier (current one) of people just not 
following the documentation (which would remain as noise), then a beginner tier 
where we could have people downvoting bad questions but in most cases the 
community can answer the questions because they are common, then a “medium” 
tier which would mean harder questions but that can still be answered by 
advanced users and lastly an “advanced” tier to which committers can actually 
subscribed to (and adding sub tags for subsystems would improve this even more).

I was not aware of SO policy for meta tags (the burnination link is about 
removing tags completely so I am not sure how it applies, I believe this link 
https://stackoverflow.blog/2010/08/the-death-of-meta-tags/ is more relevant).
There was actually a discussion along the lines in SO 
(http://meta.stackoverflow.com/questions/253338/filtering-questions-by-difficulty-level).

The fact that SO did not solve this issue, does not mean we shouldn’t either.

The way I see it, some tags can easily be used even with the meta tags 
limitation. For example, using spark-internal-development tag can be used to 
ask questions for development of spark. There are already tags for some spark 
subsystems (there is a apachae-spark-sql tag, a pyspark tag, a spark-streaming 
tag etc.). The main issue I see and the one we can’t seem to get around is 
dividing between simple questions that the community should answer and hard 
questions which only advanced users can answer.

Maybe SO isn’t the correct platform for that but even within it we can try to 
find a non meta name for spark beginner questions vs. spark advanced questions.
Assaf.


From: Denny Lee [via Apache Spark Developers List] [mailto:[hidden 
email][hidden 
email]]
Sent: Tuesday, November 08, 2016 7:53 AM
To: Mendelson, Assaf

Subject: Re: Handling questions in the mailing lists

To help track and get the verbiage for the Spark community page and welcome 
email jump started, here's a working document for us to work with: