Evolutionary algorithm (EA) in Spark

2016-11-02 Thread Chris Lin
Hi All,

I would like to know if there is any plan to implement evolutionary
algorithm in Spark ML, such as particle swarm optimization, genetic
algorithm, ant colony optimization, etc.
Therefore, if someone is working on this in Spark or has already done, I
would like to contribute to it and get some guidance on how to go about it.

Regards,
Chris Lin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Evolutionary-algorithm-EA-in-Spark-tp19716.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Evolutionary algorithm (EA) in Spark

2016-11-02 Thread Chris Lin
Hi All,

I would like to know if there is any plan to implement evolutionary
algorithm in Spark ML, such as particle swarm optimization, genetic
algorithm, ant colony optimization, etc.
Therefore, if someone is working on this in Spark or has already done, I
would like to contribute to it and get some guidance on how to go about it.

Regards,
Chris Lin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Evolutionary-algorithm-EA-in-Spark-tp19715.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-02 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version
1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.3
[ ] -1 Do not release this package because ...


The tag to be voted on is v1.6.3-rc2
(1e860747458d74a4ccbd081103a0542a2367b14b)

This release candidate addresses 52 JIRA tickets:
https://s.apache.org/spark-1.6.3-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1212/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/


===
== How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 1.6.2.


== What justifies a -1 vote for this release?

This is a maintenance release in the 1.6.x series.  Bugs already present in
1.6.2, missing features, or bugs related to new features will not
necessarily block this release.


Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-11-02 Thread Reynold Xin
This vote is cancelled and I'm sending out a new vote for rc2 now.


On Mon, Oct 17, 2016 at 5:18 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.3. The vote is open until Thursday, Oct 20, 2016 at 18:00 PDT and
> passes if a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.3
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.3-rc1 (7375bb0c825408ea010dcef31c0759
> cf94ffe5c2)
>
> This release candidate addresses 50 JIRA tickets: https://s.apache.org/
> spark-1.6.3-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1205/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-docs/
>
>
> ===
> == How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.2.
>
> 
> == What justifies a -1 vote for this release?
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.2, missing features, or bugs related to new features will not
> necessarily block this release.
>
>


Blocked PySpark changes

2016-11-02 Thread Holden Karau
Hi Spark Developers & Maintainers,

I know we've been talking a lot about what we want changes we want in
PySpark to help keep it interesting and usable (see
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html).
One of the underlying challenges that we haven't explicitly discussed is
that a reason behind the slow pace of a lot of the PySpark development is
the lack of dedicated Python reviewers.

For changes which are based around parity with an existing component,
Python contributors like myself can sometimes get reviewers from the
component (like ML) to take a look at our Python changes - but for core
changes it's even harder to get reviewers.

The general Python PR review dashboard
 shows the a number of PRs
languishing - but to specifically call out a few:

   -

   pip installability - https://github.com/apache/spark/pull/15659
   -

   KMeans summary in Python - https://github.com/apache/spark/pull/13557
   -

   The various Anaconda/Virtualenv support PRs (none of them have had any
   luck with committer bandwidth)
   -

   PySpark ML models should have params finally starting to get committer
   review - but blocked for months (
   https://github.com/apache/spark/pull/14653 )
   -

   Python meta algorithms in Scala -
   https://github.com/apache/spark/pull/13794 (out of sync with master but
   waiting for months for a committer to say if they are interested in the
   feature or not)


For those following a lot of Python JIRAs you also probably noticed a lot
of Python related JIRAs being re-targeted for future versions that keep
getting bumped back.

The lack of core Python reviewers will make things like Arrow integration
difficult to achieve unless the situation changes.

This isn't meant to say that the current Python reviewers aren't good -
there just isn't enough Python committer bandwidth available to move these
things forward. The normal solution to this is adding more committers with
that focus area.

I'd love to hear y'alls thoughts on this.

Cheers,

Holden :)


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Question about using collaborative filtering in MLlib

2016-11-02 Thread Yuhao Yang
Hi Zak,

Indeed the function is missing in DataFrame-based API. I can probably
provide some quick prototype to see if it we can merge the function into
next release. I would send update here and feel free to give it a try.

Regards,
Yuhao

2016-11-01 10:00 GMT-07:00 Zak H :

> Hi,
>
> I'm using the Java Api for Dataframe api for Spark-Mllib. Should I be
> using the RDD api instead as I'm not sure if this functionality has been
> ported over to dataframes, correct me if I'm wrong.
>
> My goal is to evaluate spark's recommendation capabilities. I'm looking
> at this example:
>
> http://spark.apache.org/docs/latest/ml-collaborative-filtering.html
>
> Looking at the java docs I can see there is a method: http://spark.apache.
> org/docs/latest/api/java/org/apache/spark/mllib/recommendation/
> MatrixFactorizationModel.html
>
> "public RDD 
>   
> []>>
>  recommendUsersForProducts(int num)"
>
>
> For some reason the recommendProductsForUsers method isn't available in
> the java api:
> model.recommendProductsForUsers
>
> Is there something I'm missing here:
>
> I've posted my code here on this gist. I am using the dataframe api for
> mllib. I know there may be work to port over functionality from RDD's.
>
> https://gist.github.com/zmhassan/6ccdda8b4ad86f9b1924477c65ed5d45
>
> Thanks,
> Zak
>


Re: Structured streaming aggregation - update mode

2016-11-02 Thread Michael Armbrust
Yeah, agreed.  As mentioned here
, its near the top of my list.
I just opened SPARK-18234
 to track.

On Wed, Nov 2, 2016 at 3:24 PM, Cristian Opris 
wrote:

> Hi,
>
> I've been looking at planned jiras for this, but can't find anything. Is
> this something that may be added soon ? It's not clear to me how
> aggregation can realistically be used in a production scenario
> without this..
>
> Thanks,
> Cristian
>


Structured streaming aggregation - update mode

2016-11-02 Thread Cristian Opris
Hi,

I've been looking at planned jiras for this, but can't find anything. Is
this something that may be added soon ? It's not clear to me how
aggregation can realistically be used in a production scenario
without this..

Thanks,
Cristian


Re: Updating Parquet dep to 1.9

2016-11-02 Thread Ryan Blue
The stats problem is on the write side. Parquet compares byte buffers (used
for UTF8 strings also) using byte-wise comparison, but got it wrong and
compares the Java byte values, which are signed. UTF8 ordering is the same
as byte-wise comparison, but only if the bytes are compared as unsigned
values. So Parquet ends up with the wrong min and max if there are
characters where the sign bit / msb is set. For ASCII, the results are
identical, but other character sets, like latin1, end up with accented
characters out of order.

Parquet 1.9.0 suppresses the min and max values when the sort order that
produced them is incorrect to fix the correctness bug in applications like
SparkSQL. There is a property to override this if you know your data has
only ASCII characters, but by default min and max are not considered
reliable and are not used to eliminate row groups with predicate push-down.
Other types aren't affected and row group filters will still work.

1.9.0 also adds dictionary filtering to predicate push-down, which can be
used in many cases to skip row groups as well. This doesn't use the min and
max values so it will still work.

The issue for the stats ordering bug is PARQUET-686. Writes will be fixed
in 1.9.1, which I'd like to have out in the next couple of weeks.

My overall recommendation is to do the update to 1.9.0, which fixes the
logging problem, too.

rb

On Wed, Nov 2, 2016 at 8:31 AM, Michael Allman  wrote:

> Sounds great. Regarding the min/max stats issue, is that an issue with the
> way the files are written or read? What's the Parquet project issue for
> that bug? What's the 1.9.1 release timeline look like?
>
> I will aim to have a PR in by the end of the week. I feel strongly that
> either this or https://github.com/apache/spark/pull/15538 needs to make
> it into 2.1. The logging output issue is really bad. I would probably call
> it a blocker.
>
> Michael
>
>
> On Nov 1, 2016, at 1:22 PM, Ryan Blue  wrote:
>
> I can when I'm finished with a couple other issues if no one gets to it
> first.
>
> Michael, if you're interested in updating to 1.9.0 I'm happy to help
> review that PR.
>
> On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin  wrote:
>
>> Ryan want to submit a pull request?
>>
>>
>> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue 
>> wrote:
>>
>>> 1.9.0 includes some fixes intended specifically for Spark:
>>>
>>> * PARQUET-389: Evaluates push-down predicates for missing columns as
>>> though they are null. This is to address Spark's work-around that requires
>>> reading and merging file schemas, even for metastore tables.
>>> * PARQUET-654: Adds an option to disable record-level predicate
>>> push-down, but keep row group evaluation. This allows Spark to skip row
>>> groups based on stats and dictionaries, but implement its own vectorized
>>> record filtering.
>>>
>>> The Parquet community also evaluated performance to ensure no
>>> performance regressions from moving to the ByteBuffer read path.
>>>
>>> There is one concern about 1.9.0 that will be addressed in 1.9.1, which
>>> is that stats calculations were incorrectly using unsigned byte order for
>>> string comparison. This means that min/max stats can't be used if the data
>>> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
>>> return the bad min/max values for correctness, but there is a property to
>>> override this behavior for data that doesn't use the affected code points.
>>>
>>> Upgrading to 1.9.0 depends on how the community wants to handle the sort
>>> order bug: whether correctness or performance should be the default.
>>>
>>> rb
>>>
>>> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen  wrote:
>>>
 Yes this came up from a different direction: https://issues.apac
 he.org/jira/browse/SPARK-18140

 I think it's fine to pursue an upgrade to fix these several issues. The
 question is just how well it will play with other components, so bears some
 testing and evaluation of the changes from 1.8, but yes this would be good.

 On Mon, Oct 31, 2016 at 9:07 PM Michael Allman 
 wrote:

> Hi All,
>
> Is anyone working on updating Spark's Parquet library dep to 1.9? If
> not, I can at least get started on it and publish a PR.
>
> Cheers,
>
> Michael
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix


BiMap BroadCast Variable - Kryo Serialization Issue

2016-11-02 Thread Kalpana Jalawadi
Hi,

I am getting Nullpointer exception due to Kryo Serialization issue, while
trying to read a BiMap broadcast variable. Attached is the code snippets.
Pointers shared here didn't help - link1
,
link2
.
Spark version used is 1.6.x, but this was working with 1.3.x version.

Any help in this regard is much appreciated.

Exception:

com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
App > Serialization trace:
App > value (com.demo.BiMapWrapper)
App > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1238)
App > at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
App > at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
App > at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
App > at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
App > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
App > at
com.manthan.aaas.algo.associationmining.impl.Test.lambda$execute$6abf5fd0$1(Test.java:39)
App > at
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
App > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
App > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
App > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
App > at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
App > at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
App > at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
App > at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273)
App > at scala.collection.AbstractIterator.to(Iterator.scala:1157)
App > at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
App > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
App > at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
App > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
App > at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
App > at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
App > at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1978)
App > at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1978)
App > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
App > at org.apache.spark.scheduler.Task.run(Task.scala:89)
App > at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
App > at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
App > at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
App > at java.lang.Thread.run(Thread.java:745)
App > Caused by: com.esotericsoftware.kryo.KryoException:
java.lang.NullPointerException
App > Serialization trace:
App > value (com.demo.BiMapWrapper)
App > at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
App > at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
App > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
App > at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
App > at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:217)
App > at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
App > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1231)
App > ... 29 more
App > Caused by: java.lang.NullPointerException
App > at com.google.common.collect.HashBiMap.seekByKey(HashBiMap.java:180)
App > at com.google.common.collect.HashBiMap.put(HashBiMap.java:230)
App > at com.google.common.collect.HashBiMap.put(HashBiMap.java:218)
App > at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
App > at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
App > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
App > at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
App > ... 35 more
App >
App > 16/11/02 18:39:01 dispatcher-event-loop-2 INFO TaskSetManager:
Starting task 17.1 in stage 1.0 (TID 19, ip-10-0-1-237.ec2.internal,
partition 17,PROCESS_LOCAL, 2076 bytes)
App > 16/11/02 18:39:01 task-result-getter-3 INFO TaskSetManager: Lost task
17.1 in stage 1.0 (TID 19) on executor ip-10-0-1-237.ec2.internal:
java.io.IOException (com.esotericsoftware.kryo.KryoException:
java.lang.NullPointerException
App > Serialization trace:
App > 

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Prajwal Tuladhar
Some messages from Apache mailing lists (Spark and ZK) were being marked as
spam by Gmail. After manually unmarking them as Spam few times, it seems to
have worked for me.

On Wed, Nov 2, 2016 at 5:29 PM, Russell Spitzer 
wrote:

> I had one bounce message last week, but haven't seen anything else, I also
> do the skip inbox filter thing though.
>
> On Wed, Nov 2, 2016 at 10:16 AM Matei Zaharia 
> wrote:
>
>> It might be useful to ask Apache Infra whether they have any information
>> on these (e.g. what do their own spam metrics say, do they get any feedback
>> from Google, etc). Unfortunately mailing lists seem to be less and less
>> well supported by most email providers.
>>
>> Matei
>>
>> On Nov 2, 2016, at 6:48 AM, Pete Robbins  wrote:
>>
>> I have gmail filters to add labels and skip inbox for anything sent to
>> dev@spark user@spark etc but still get the occasional message marked as
>> spam
>>
>>
>> On Wed, 2 Nov 2016 at 08:18 Sean Owen  wrote:
>>
>> I couldn't figure out why I was missing a lot of dev@ announcements, and
>> have just realized hundreds of messages to dev@ over the past month or
>> so have been marked as spam for me by Gmail. I have no idea why but it's
>> usually messages from Michael and Reynold, but not all of them. I'll see
>> replies to the messages but not the original. Who knows. I can make a
>> filter. I just wanted to give a heads up in case anyone else has been
>> silently missing a lot of messages.
>>
>>
>>


-- 
--
Cheers,
Praj


Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Adam Roberts
I'm seeing the same failure but manifesting itself as a stackoverflow, 
various operating systems and architectures (RHEL 71, CentOS 72, SUSE 12, 
Ubuntu 14 04 and 16 04 LTS)

Build and test options:
mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-DskipTests clean package

mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test

-Xss2048k -Dspark.buffer.pageSize=1048576 -Xmx4g

Stacktrace (this is with IBM's latest SDK for Java 8):

  scala> org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): 
com.google.common.util.concurrent.ExecutionError: 
java.lang.StackOverflowError: operating system stack overflow
at 
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at 
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:849)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:188)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:833)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:830)
at 
org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:137)
... omitted the rest for brevity

Would also be useful to include this small but useful change that looks to 
have only just missed the cut: https://github.com/apache/spark/pull/14409




From:   Reynold Xin 
To: Dongjoon Hyun 
Cc: "dev@spark.apache.org" 
Date:   02/11/2016 18:37
Subject:Re: [VOTE] Release Apache Spark 2.0.2 (RC2)



Looks like there is an issue with Maven (likely just the test itself 
though). We should look into it.


On Wed, Nov 2, 2016 at 11:32 AM, Dongjoon Hyun  
wrote:
Hi, Sean.

The same failure blocks me, too.

- SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** 
FAILED ***

I used `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
-Dsparkr` on CentOS 7 / OpenJDK1.8.0_111.

Dongjoon.

On 2016-11-02 10:44 (-0700), Sean Owen  wrote:
> Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here 
are
> the 4 issues still open:
>
> SPARK-14387 Enable Hive-1.x ORC compatibility with
> spark.sql.hive.convertMetastoreOrc
> SPARK-17957 Calling outer join and na.fill(0) and then inner join will 
miss
> rows
> SPARK-17981 Incorrectly Set Nullability to False in FilterExec
> SPARK-18160 spark.files & spark.jars should not be passed to driver in 
yarn
> mode
>
> Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on
> Ubuntu 16, I am seeing consistent failures in this test below. I think 
we
> very recently changed this so it could be legitimate. But does anyone 
else
> see something like this? I have seen other failures in this test due to 
OOM
> but my MAVEN_OPTS allows 6g of heap, which ought to be plenty.
>
>
> - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** 
FAILED
> ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
> /_/
>
>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>
>   scala>
>   scala> keyValueGrouped:
> org.apache.spark.sql.KeyValueGroupedDataset[Int,(Int, Int)] =
> org.apache.spark.sql.KeyValueGroupedDataset@70c30f72
>
>   scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int,
> _2: int]
>
>   scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] =
> Broadcast(0)
>
>   scala>
>   scala>
>   scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int]
>
>   scala> org.apache.spark.SparkException: Job aborted due to stage 
failure:
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in
> stage 0.0 (TID 0, localhost):
> com.google.common.util.concurrent.ExecutionError:
> java.lang.ClassCircularityError:
> 
io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
>   at 
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at 

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Reynold Xin
Looks like there is an issue with Maven (likely just the test itself
though). We should look into it.


On Wed, Nov 2, 2016 at 11:32 AM, Dongjoon Hyun  wrote:

> Hi, Sean.
>
> The same failure blocks me, too.
>
> - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset ***
> FAILED ***
>
> I used `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dsparkr` on CentOS 7 / OpenJDK1.8.0_111.
>
> Dongjoon.
>
> On 2016-11-02 10:44 (-0700), Sean Owen  wrote:
> > Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here
> are
> > the 4 issues still open:
> >
> > SPARK-14387 Enable Hive-1.x ORC compatibility with
> > spark.sql.hive.convertMetastoreOrc
> > SPARK-17957 Calling outer join and na.fill(0) and then inner join will
> miss
> > rows
> > SPARK-17981 Incorrectly Set Nullability to False in FilterExec
> > SPARK-18160 spark.files & spark.jars should not be passed to driver in
> yarn
> > mode
> >
> > Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on
> > Ubuntu 16, I am seeing consistent failures in this test below. I think we
> > very recently changed this so it could be legitimate. But does anyone
> else
> > see something like this? I have seen other failures in this test due to
> OOM
> > but my MAVEN_OPTS allows 6g of heap, which ought to be plenty.
> >
> >
> > - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset ***
> FAILED
> > ***
> >   isContain was true Interpreter output contained 'Exception':
> >   Welcome to
> >   __
> >/ __/__  ___ _/ /__
> >   _\ \/ _ \/ _ `/ __/  '_/
> >  /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
> > /_/
> >
> >   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
> >   Type in expressions to have them evaluated.
> >   Type :help for more information.
> >
> >   scala>
> >   scala> keyValueGrouped:
> > org.apache.spark.sql.KeyValueGroupedDataset[Int,(Int, Int)] =
> > org.apache.spark.sql.KeyValueGroupedDataset@70c30f72
> >
> >   scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int,
> > _2: int]
> >
> >   scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] =
> > Broadcast(0)
> >
> >   scala>
> >   scala>
> >   scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int]
> >
> >   scala> org.apache.spark.SparkException: Job aborted due to stage
> failure:
> > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
> > stage 0.0 (TID 0, localhost):
> > com.google.common.util.concurrent.ExecutionError:
> > java.lang.ClassCircularityError:
> > io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/
> MessageMatcher
> >   at com.google.common.cache.LocalCache$Segment.get(
> LocalCache.java:2261)
> >   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
> >   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> >   at
> > com.google.common.cache.LocalCache$LocalLoadingCache.
> get(LocalCache.java:4874)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$.compile(CodeGenerator.scala:841)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> GenerateSafeProjection$.create(GenerateSafeProjection.scala:188)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator.generate(CodeGenerator.scala:825)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator.generate(CodeGenerator.scala:822)
> >   at
> > org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(
> objects.scala:137)
> >   at
> > org.apache.spark.sql.execution.AppendColumnsExec$$
> anonfun$9.apply(objects.scala:251)
> >   at
> > org.apache.spark.sql.execution.AppendColumnsExec$$
> anonfun$9.apply(objects.scala:250)
> >   at
> > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >   at
> > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >   at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> >   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> >   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> >   at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> >   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> >   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> >   at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:79)
> >   at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:47)
> >   at org.apache.spark.scheduler.Task.run(Task.scala:86)
> >   at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:274)
> >   at
> > 

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Dongjoon Hyun
Hi, Sean.

The same failure blocks me, too.

- SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED ***

I used `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Dsparkr` 
on CentOS 7 / OpenJDK1.8.0_111.

Dongjoon.

On 2016-11-02 10:44 (-0700), Sean Owen  wrote: 
> Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here are
> the 4 issues still open:
> 
> SPARK-14387 Enable Hive-1.x ORC compatibility with
> spark.sql.hive.convertMetastoreOrc
> SPARK-17957 Calling outer join and na.fill(0) and then inner join will miss
> rows
> SPARK-17981 Incorrectly Set Nullability to False in FilterExec
> SPARK-18160 spark.files & spark.jars should not be passed to driver in yarn
> mode
> 
> Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on
> Ubuntu 16, I am seeing consistent failures in this test below. I think we
> very recently changed this so it could be legitimate. But does anyone else
> see something like this? I have seen other failures in this test due to OOM
> but my MAVEN_OPTS allows 6g of heap, which ought to be plenty.
> 
> 
> - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED
> ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
> /_/
> 
>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
> 
>   scala>
>   scala> keyValueGrouped:
> org.apache.spark.sql.KeyValueGroupedDataset[Int,(Int, Int)] =
> org.apache.spark.sql.KeyValueGroupedDataset@70c30f72
> 
>   scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int,
> _2: int]
> 
>   scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] =
> Broadcast(0)
> 
>   scala>
>   scala>
>   scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int]
> 
>   scala> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
> stage 0.0 (TID 0, localhost):
> com.google.common.util.concurrent.ExecutionError:
> java.lang.ClassCircularityError:
> io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:841)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:188)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:825)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:822)
>   at
> org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:137)
>   at
> org.apache.spark.sql.execution.AppendColumnsExec$$anonfun$9.apply(objects.scala:251)
>   at
> org.apache.spark.sql.execution.AppendColumnsExec$$anonfun$9.apply(objects.scala:250)
>   at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>   Caused by: java.lang.ClassCircularityError:
> io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at
> 

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Sean Owen
Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here are
the 4 issues still open:

SPARK-14387 Enable Hive-1.x ORC compatibility with
spark.sql.hive.convertMetastoreOrc
SPARK-17957 Calling outer join and na.fill(0) and then inner join will miss
rows
SPARK-17981 Incorrectly Set Nullability to False in FilterExec
SPARK-18160 spark.files & spark.jars should not be passed to driver in yarn
mode

Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on
Ubuntu 16, I am seeing consistent failures in this test below. I think we
very recently changed this so it could be legitimate. But does anyone else
see something like this? I have seen other failures in this test due to OOM
but my MAVEN_OPTS allows 6g of heap, which ought to be plenty.


- SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED
***
  isContain was true Interpreter output contained 'Exception':
  Welcome to
  __
   / __/__  ___ _/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
/_/

  Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
  Type in expressions to have them evaluated.
  Type :help for more information.

  scala>
  scala> keyValueGrouped:
org.apache.spark.sql.KeyValueGroupedDataset[Int,(Int, Int)] =
org.apache.spark.sql.KeyValueGroupedDataset@70c30f72

  scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int,
_2: int]

  scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] =
Broadcast(0)

  scala>
  scala>
  scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int]

  scala> org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
stage 0.0 (TID 0, localhost):
com.google.common.util.concurrent.ExecutionError:
java.lang.ClassCircularityError:
io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
  at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
  at
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
  at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:841)
  at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:188)
  at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
  at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:825)
  at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:822)
  at
org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:137)
  at
org.apache.spark.sql.execution.AppendColumnsExec$$anonfun$9.apply(objects.scala:251)
  at
org.apache.spark.sql.execution.AppendColumnsExec$$anonfun$9.apply(objects.scala:250)
  at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
  at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
  at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  Caused by: java.lang.ClassCircularityError:
io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at
io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
  at
io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
  at
io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
  at
io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
  at
io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:60)
  at

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Russell Spitzer
I had one bounce message last week, but haven't seen anything else, I also
do the skip inbox filter thing though.

On Wed, Nov 2, 2016 at 10:16 AM Matei Zaharia 
wrote:

> It might be useful to ask Apache Infra whether they have any information
> on these (e.g. what do their own spam metrics say, do they get any feedback
> from Google, etc). Unfortunately mailing lists seem to be less and less
> well supported by most email providers.
>
> Matei
>
> On Nov 2, 2016, at 6:48 AM, Pete Robbins  wrote:
>
> I have gmail filters to add labels and skip inbox for anything sent to
> dev@spark user@spark etc but still get the occasional message marked as
> spam
>
>
> On Wed, 2 Nov 2016 at 08:18 Sean Owen  wrote:
>
> I couldn't figure out why I was missing a lot of dev@ announcements, and
> have just realized hundreds of messages to dev@ over the past month or so
> have been marked as spam for me by Gmail. I have no idea why but it's
> usually messages from Michael and Reynold, but not all of them. I'll see
> replies to the messages but not the original. Who knows. I can make a
> filter. I just wanted to give a heads up in case anyone else has been
> silently missing a lot of messages.
>
>
>


Re: Handling questions in the mailing lists

2016-11-02 Thread Reynold Xin
Actually after talking with more ASF members, I believe the only policy is
that development decisions have to be made and announced on ASF properties
(dev list or jira), but user questions don't have to.

I'm going to double check this. If it is true, I would actually recommend
us moving entirely over the Q part of the user list to stackoverflow, or
at least make that the recommended way rather than the existing user list
which is not very scalable.

On Wednesday, November 2, 2016, Nicholas Chammas 
wrote:

> We’ve discussed several times upgrading our communication tools, as far
> back as 2014 and maybe even before that too. The bottom line is that we
> can’t due to ASF rules requiring the use of ASF-managed mailing lists.
>
> For some history, see this discussion:
>
>- https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%
>3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oy5no2dhwj_kveop...@mail.gmail.com%3E
>
> 
>- https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%
>3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=tktxy_...@mail.gmail.com%3E
>
> 
>
> (It’s ironic that it’s difficult to follow the past discussion on why we
> can’t change our official communication tools due to those very tools…)
>
> Nick
> ​
>
> On Wed, Nov 2, 2016 at 12:24 PM Ricardo Almeida <
> ricardo.alme...@actnowib.com
> > wrote:
>
>> I fell Assaf point is quite relevant if we want to move this project
>> forward from the Spark user perspective (as I do). In fact, we're still
>> using 20th century tools (mailing lists) with some add-ons (like Stack
>> Overflow).
>>
>> As usually, Sean and Cody's contributions are very to the point.
>> I fell it is indeed a matter of of culture (hard to enforce) and tools
>> (much easier). Isn't it?
>>
>> On 2 November 2016 at 16:36, Cody Koeninger > > wrote:
>>
>>> So concrete things people could do
>>>
>>> - users could tag subject lines appropriately to the component they're
>>> asking about
>>>
>>> - contributors could monitor user@ for tags relating to components
>>> they've worked on.
>>> I'd be surprised if my miss rate for any mailing list questions
>>> well-labeled as Kafka was higher than 5%
>>>
>>> - committers could be more aggressive about soliciting and merging PRs
>>> to improve documentation.
>>> It's a lot easier to answer even poorly-asked questions with a link to
>>> relevant docs.
>>>
>>> On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen >> > wrote:
>>> > There's already reviews@ and issues@. dev@ is for project development
>>> itself
>>> > and I think is OK. You're suggesting splitting up user@ and I
>>> sympathize
>>> > with the motivation. Experience tells me that we'll have a beginner@
>>> that's
>>> > then totally ignored, and people will quickly learn to post to
>>> advanced@ to
>>> > get attention, and we'll be back where we started. Putting it in JIRA
>>> > doesn't help. I don't think this a problem that is merely down to lack
>>> of
>>> > process. It actually requires cultivating a culture change on the
>>> community
>>> > list.
>>> >
>>> > On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf <
>>> assaf.mendel...@rsa.com
>>> >
>>> > wrote:
>>> >>
>>> >> What I am suggesting is basically to fix that.
>>> >>
>>> >> For example, we might say that mailing list A is only for voting,
>>> mailing
>>> >> list B is only for PR and have something like stack overflow for
>>> developer
>>> >> questions (I would even go as far as to have beginner, intermediate
>>> and
>>> >> advanced mailing list for users and beginner/advanced for dev).
>>> >>
>>> >>
>>> >>
>>> >> This can easily be done using stack overflow tags, however, that would
>>> >> probably be harder to manage.
>>> >>
>>> >> Maybe using special jira tags and manage it in jira?
>>> >>
>>> >>
>>> >>
>>> >> Anyway as I said, the main issue is not user questions (except maybe
>>> >> advanced ones) but more for dev questions. It is so easy to get lost
>>> in the
>>> >> chatter that it makes it very hard for people to learn spark
>>> internals…
>>> >>
>>> >> Assaf.
>>> >>
>>> >>
>>> >>
>>> >> From: Sean Owen [mailto:so...@cloudera.com
>>> ]
>>> >> Sent: Wednesday, November 02, 2016 2:07 PM
>>> >> To: Mendelson, Assaf; dev@spark.apache.org
>>> 
>>> >> Subject: Re: Handling questions in the mailing lists
>>> >>
>>> >>
>>> >>
>>> >> I think that 

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Matei Zaharia
It might be useful to ask Apache Infra whether they have any information on 
these (e.g. what do their own spam metrics say, do they get any feedback from 
Google, etc). Unfortunately mailing lists seem to be less and less well 
supported by most email providers.

Matei

> On Nov 2, 2016, at 6:48 AM, Pete Robbins  wrote:
> 
> I have gmail filters to add labels and skip inbox for anything sent to 
> dev@spark user@spark etc but still get the occasional message marked as spam
> 
> 
> On Wed, 2 Nov 2016 at 08:18 Sean Owen  > wrote:
> I couldn't figure out why I was missing a lot of dev@ announcements, and have 
> just realized hundreds of messages to dev@ over the past month or so have 
> been marked as spam for me by Gmail. I have no idea why but it's usually 
> messages from Michael and Reynold, but not all of them. I'll see replies to 
> the messages but not the original. Who knows. I can make a filter. I just 
> wanted to give a heads up in case anyone else has been silently missing a lot 
> of messages.



Re: Handling questions in the mailing lists

2016-11-02 Thread Nicholas Chammas
We’ve discussed several times upgrading our communication tools, as far
back as 2014 and maybe even before that too. The bottom line is that we
can’t due to ASF rules requiring the use of ASF-managed mailing lists.

For some history, see this discussion:

   -
   
https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oy5no2dhwj_kveop...@mail.gmail.com%3E
   -
   
https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=tktxy_...@mail.gmail.com%3E

(It’s ironic that it’s difficult to follow the past discussion on why we
can’t change our official communication tools due to those very tools…)

Nick
​

On Wed, Nov 2, 2016 at 12:24 PM Ricardo Almeida <
ricardo.alme...@actnowib.com> wrote:

> I fell Assaf point is quite relevant if we want to move this project
> forward from the Spark user perspective (as I do). In fact, we're still
> using 20th century tools (mailing lists) with some add-ons (like Stack
> Overflow).
>
> As usually, Sean and Cody's contributions are very to the point.
> I fell it is indeed a matter of of culture (hard to enforce) and tools
> (much easier). Isn't it?
>
> On 2 November 2016 at 16:36, Cody Koeninger  wrote:
>
> So concrete things people could do
>
> - users could tag subject lines appropriately to the component they're
> asking about
>
> - contributors could monitor user@ for tags relating to components
> they've worked on.
> I'd be surprised if my miss rate for any mailing list questions
> well-labeled as Kafka was higher than 5%
>
> - committers could be more aggressive about soliciting and merging PRs
> to improve documentation.
> It's a lot easier to answer even poorly-asked questions with a link to
> relevant docs.
>
> On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen  wrote:
> > There's already reviews@ and issues@. dev@ is for project development
> itself
> > and I think is OK. You're suggesting splitting up user@ and I sympathize
> > with the motivation. Experience tells me that we'll have a beginner@
> that's
> > then totally ignored, and people will quickly learn to post to advanced@
> to
> > get attention, and we'll be back where we started. Putting it in JIRA
> > doesn't help. I don't think this a problem that is merely down to lack of
> > process. It actually requires cultivating a culture change on the
> community
> > list.
> >
> > On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf <
> assaf.mendel...@rsa.com>
> > wrote:
> >>
> >> What I am suggesting is basically to fix that.
> >>
> >> For example, we might say that mailing list A is only for voting,
> mailing
> >> list B is only for PR and have something like stack overflow for
> developer
> >> questions (I would even go as far as to have beginner, intermediate and
> >> advanced mailing list for users and beginner/advanced for dev).
> >>
> >>
> >>
> >> This can easily be done using stack overflow tags, however, that would
> >> probably be harder to manage.
> >>
> >> Maybe using special jira tags and manage it in jira?
> >>
> >>
> >>
> >> Anyway as I said, the main issue is not user questions (except maybe
> >> advanced ones) but more for dev questions. It is so easy to get lost in
> the
> >> chatter that it makes it very hard for people to learn spark internals…
> >>
> >> Assaf.
> >>
> >>
> >>
> >> From: Sean Owen [mailto:so...@cloudera.com]
> >> Sent: Wednesday, November 02, 2016 2:07 PM
> >> To: Mendelson, Assaf; dev@spark.apache.org
> >> Subject: Re: Handling questions in the mailing lists
> >>
> >>
> >>
> >> I think that unfortunately mailing lists don't scale well. This one has
> >> thousands of subscribers with different interests and levels of
> experience.
> >> For any given person, most messages will be irrelevant. I also find
> that a
> >> lot of questions on user@ are not well-asked, aren't an SSCCE
> >> (http://sscce.org/), not something most people are going to bother
> replying
> >> to even if they could answer. I almost entirely ignore user@ because
> there
> >> are higher-priority channels like PRs to deal with, that already have
> >> hundreds of messages per day. This is why little of it gets an answer
> -- too
> >> noisy.
> >>
> >>
> >>
> >> We have to have official mailing lists, in any event, to have some
> >> official channel for things like votes and announcements. It's not
> wrong to
> >> ask questions on user@ of course, but a lot of the questions I see
> could
> >> have been answered with research of existing docs or looking at the
> code. I
> >> think that given the scale of the list, it's not wrong to assert that
> this
> >> is sort of a prerequisite for asking thousands of people to answer one's
> >> question. But we can't enforce that.
> >>
> >>
> >>
> >> The situation will get better to the extent people ask better questions,
> >> help other people ask better questions, and answer good questions. I'd
> >> encourage anyone feeling this way to try to help along 

Re: Handling questions in the mailing lists

2016-11-02 Thread Ricardo Almeida
I fell Assaf point is quite relevant if we want to move this project
forward from the Spark user perspective (as I do). In fact, we're still
using 20th century tools (mailing lists) with some add-ons (like Stack
Overflow).

As usually, Sean and Cody's contributions are very to the point.
I fell it is indeed a matter of of culture (hard to enforce) and tools
(much easier). Isn't it?

On 2 November 2016 at 16:36, Cody Koeninger  wrote:

> So concrete things people could do
>
> - users could tag subject lines appropriately to the component they're
> asking about
>
> - contributors could monitor user@ for tags relating to components
> they've worked on.
> I'd be surprised if my miss rate for any mailing list questions
> well-labeled as Kafka was higher than 5%
>
> - committers could be more aggressive about soliciting and merging PRs
> to improve documentation.
> It's a lot easier to answer even poorly-asked questions with a link to
> relevant docs.
>
> On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen  wrote:
> > There's already reviews@ and issues@. dev@ is for project development
> itself
> > and I think is OK. You're suggesting splitting up user@ and I sympathize
> > with the motivation. Experience tells me that we'll have a beginner@
> that's
> > then totally ignored, and people will quickly learn to post to advanced@
> to
> > get attention, and we'll be back where we started. Putting it in JIRA
> > doesn't help. I don't think this a problem that is merely down to lack of
> > process. It actually requires cultivating a culture change on the
> community
> > list.
> >
> > On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf <
> assaf.mendel...@rsa.com>
> > wrote:
> >>
> >> What I am suggesting is basically to fix that.
> >>
> >> For example, we might say that mailing list A is only for voting,
> mailing
> >> list B is only for PR and have something like stack overflow for
> developer
> >> questions (I would even go as far as to have beginner, intermediate and
> >> advanced mailing list for users and beginner/advanced for dev).
> >>
> >>
> >>
> >> This can easily be done using stack overflow tags, however, that would
> >> probably be harder to manage.
> >>
> >> Maybe using special jira tags and manage it in jira?
> >>
> >>
> >>
> >> Anyway as I said, the main issue is not user questions (except maybe
> >> advanced ones) but more for dev questions. It is so easy to get lost in
> the
> >> chatter that it makes it very hard for people to learn spark internals…
> >>
> >> Assaf.
> >>
> >>
> >>
> >> From: Sean Owen [mailto:so...@cloudera.com]
> >> Sent: Wednesday, November 02, 2016 2:07 PM
> >> To: Mendelson, Assaf; dev@spark.apache.org
> >> Subject: Re: Handling questions in the mailing lists
> >>
> >>
> >>
> >> I think that unfortunately mailing lists don't scale well. This one has
> >> thousands of subscribers with different interests and levels of
> experience.
> >> For any given person, most messages will be irrelevant. I also find
> that a
> >> lot of questions on user@ are not well-asked, aren't an SSCCE
> >> (http://sscce.org/), not something most people are going to bother
> replying
> >> to even if they could answer. I almost entirely ignore user@ because
> there
> >> are higher-priority channels like PRs to deal with, that already have
> >> hundreds of messages per day. This is why little of it gets an answer
> -- too
> >> noisy.
> >>
> >>
> >>
> >> We have to have official mailing lists, in any event, to have some
> >> official channel for things like votes and announcements. It's not
> wrong to
> >> ask questions on user@ of course, but a lot of the questions I see
> could
> >> have been answered with research of existing docs or looking at the
> code. I
> >> think that given the scale of the list, it's not wrong to assert that
> this
> >> is sort of a prerequisite for asking thousands of people to answer one's
> >> question. But we can't enforce that.
> >>
> >>
> >>
> >> The situation will get better to the extent people ask better questions,
> >> help other people ask better questions, and answer good questions. I'd
> >> encourage anyone feeling this way to try to help along those dimensions.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Nov 2, 2016 at 11:32 AM assaf.mendelson <
> assaf.mendel...@rsa.com>
> >> wrote:
> >>
> >> Hi,
> >>
> >> I know this is a little off topic but I wanted to raise an issue about
> >> handling questions in the mailing list (this is true both for the user
> >> mailing list and the dev but since there are other options such as stack
> >> overflow for user questions, this is more problematic in dev).
> >>
> >> Let’s say I ask a question (as I recently did). Unfortunately this was
> >> during spark summit in Europe so probably people were busy. In any case
> no
> >> one answered.
> >>
> >> The problem is, that if no one answers very soon, the question will
> almost
> >> certainly remain unanswered because new messages will 

Re: Handling questions in the mailing lists

2016-11-02 Thread Cody Koeninger
So concrete things people could do

- users could tag subject lines appropriately to the component they're
asking about

- contributors could monitor user@ for tags relating to components
they've worked on.
I'd be surprised if my miss rate for any mailing list questions
well-labeled as Kafka was higher than 5%

- committers could be more aggressive about soliciting and merging PRs
to improve documentation.
It's a lot easier to answer even poorly-asked questions with a link to
relevant docs.

On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen  wrote:
> There's already reviews@ and issues@. dev@ is for project development itself
> and I think is OK. You're suggesting splitting up user@ and I sympathize
> with the motivation. Experience tells me that we'll have a beginner@ that's
> then totally ignored, and people will quickly learn to post to advanced@ to
> get attention, and we'll be back where we started. Putting it in JIRA
> doesn't help. I don't think this a problem that is merely down to lack of
> process. It actually requires cultivating a culture change on the community
> list.
>
> On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf 
> wrote:
>>
>> What I am suggesting is basically to fix that.
>>
>> For example, we might say that mailing list A is only for voting, mailing
>> list B is only for PR and have something like stack overflow for developer
>> questions (I would even go as far as to have beginner, intermediate and
>> advanced mailing list for users and beginner/advanced for dev).
>>
>>
>>
>> This can easily be done using stack overflow tags, however, that would
>> probably be harder to manage.
>>
>> Maybe using special jira tags and manage it in jira?
>>
>>
>>
>> Anyway as I said, the main issue is not user questions (except maybe
>> advanced ones) but more for dev questions. It is so easy to get lost in the
>> chatter that it makes it very hard for people to learn spark internals…
>>
>> Assaf.
>>
>>
>>
>> From: Sean Owen [mailto:so...@cloudera.com]
>> Sent: Wednesday, November 02, 2016 2:07 PM
>> To: Mendelson, Assaf; dev@spark.apache.org
>> Subject: Re: Handling questions in the mailing lists
>>
>>
>>
>> I think that unfortunately mailing lists don't scale well. This one has
>> thousands of subscribers with different interests and levels of experience.
>> For any given person, most messages will be irrelevant. I also find that a
>> lot of questions on user@ are not well-asked, aren't an SSCCE
>> (http://sscce.org/), not something most people are going to bother replying
>> to even if they could answer. I almost entirely ignore user@ because there
>> are higher-priority channels like PRs to deal with, that already have
>> hundreds of messages per day. This is why little of it gets an answer -- too
>> noisy.
>>
>>
>>
>> We have to have official mailing lists, in any event, to have some
>> official channel for things like votes and announcements. It's not wrong to
>> ask questions on user@ of course, but a lot of the questions I see could
>> have been answered with research of existing docs or looking at the code. I
>> think that given the scale of the list, it's not wrong to assert that this
>> is sort of a prerequisite for asking thousands of people to answer one's
>> question. But we can't enforce that.
>>
>>
>>
>> The situation will get better to the extent people ask better questions,
>> help other people ask better questions, and answer good questions. I'd
>> encourage anyone feeling this way to try to help along those dimensions.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 2, 2016 at 11:32 AM assaf.mendelson 
>> wrote:
>>
>> Hi,
>>
>> I know this is a little off topic but I wanted to raise an issue about
>> handling questions in the mailing list (this is true both for the user
>> mailing list and the dev but since there are other options such as stack
>> overflow for user questions, this is more problematic in dev).
>>
>> Let’s say I ask a question (as I recently did). Unfortunately this was
>> during spark summit in Europe so probably people were busy. In any case no
>> one answered.
>>
>> The problem is, that if no one answers very soon, the question will almost
>> certainly remain unanswered because new messages will simply drown it.
>>
>>
>>
>> This is a common issue not just for questions but for any comment or idea
>> which is not immediately picked up.
>>
>>
>>
>> I believe we should have a method of handling this.
>>
>> Generally, I would say these types of things belong in stack overflow,
>> after all, the way it is built is perfect for this. More seasoned spark
>> contributors and committers can periodically check out unanswered questions
>> and answer them.
>>
>> The problem is that stack overflow (as well as other targets such as the
>> databricks forums) tend to have a more user based orientation. This means
>> that any spark internal question will almost certainly remain unanswered.
>>
>>
>>
>> I was wondering 

Re: Updating Parquet dep to 1.9

2016-11-02 Thread Michael Allman
Sounds great. Regarding the min/max stats issue, is that an issue with the way 
the files are written or read? What's the Parquet project issue for that bug? 
What's the 1.9.1 release timeline look like?

I will aim to have a PR in by the end of the week. I feel strongly that either 
this or https://github.com/apache/spark/pull/15538 
 needs to make it into 2.1. The 
logging output issue is really bad. I would probably call it a blocker.

Michael


> On Nov 1, 2016, at 1:22 PM, Ryan Blue  wrote:
> 
> I can when I'm finished with a couple other issues if no one gets to it first.
> 
> Michael, if you're interested in updating to 1.9.0 I'm happy to help review 
> that PR.
> 
> On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin  > wrote:
> Ryan want to submit a pull request?
> 
> 
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue  > wrote:
> 1.9.0 includes some fixes intended specifically for Spark:
> 
> * PARQUET-389: Evaluates push-down predicates for missing columns as though 
> they are null. This is to address Spark's work-around that requires reading 
> and merging file schemas, even for metastore tables.
> * PARQUET-654: Adds an option to disable record-level predicate push-down, 
> but keep row group evaluation. This allows Spark to skip row groups based on 
> stats and dictionaries, but implement its own vectorized record filtering.
> 
> The Parquet community also evaluated performance to ensure no performance 
> regressions from moving to the ByteBuffer read path.
> 
> There is one concern about 1.9.0 that will be addressed in 1.9.1, which is 
> that stats calculations were incorrectly using unsigned byte order for string 
> comparison. This means that min/max stats can't be used if the data contains 
> (or may contain) UTF8 characters with the msb set. 1.9.0 won't return the bad 
> min/max values for correctness, but there is a property to override this 
> behavior for data that doesn't use the affected code points.
> 
> Upgrading to 1.9.0 depends on how the community wants to handle the sort 
> order bug: whether correctness or performance should be the default.
> 
> rb
> 
> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen  > wrote:
> Yes this came up from a different direction: 
> https://issues.apache.org/jira/browse/SPARK-18140 
> 
> 
> I think it's fine to pursue an upgrade to fix these several issues. The 
> question is just how well it will play with other components, so bears some 
> testing and evaluation of the changes from 1.8, but yes this would be good.
> 
> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman  > wrote:
> Hi All,
> 
> Is anyone working on updating Spark's Parquet library dep to 1.9? If not, I 
> can at least get started on it and publish a PR.
> 
> Cheers,
> 
> Michael
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix



Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Pete Robbins
I have gmail filters to add labels and skip inbox for anything sent to
dev@spark user@spark etc but still get the occasional message marked as spam


On Wed, 2 Nov 2016 at 08:18 Sean Owen  wrote:

> I couldn't figure out why I was missing a lot of dev@ announcements, and
> have just realized hundreds of messages to dev@ over the past month or so
> have been marked as spam for me by Gmail. I have no idea why but it's
> usually messages from Michael and Reynold, but not all of them. I'll see
> replies to the messages but not the original. Who knows. I can make a
> filter. I just wanted to give a heads up in case anyone else has been
> silently missing a lot of messages.
>


Re: Handling questions in the mailing lists

2016-11-02 Thread Sean Owen
There's already reviews@ and issues@. dev@ is for project development
itself and I think is OK. You're suggesting splitting up user@ and I
sympathize with the motivation. Experience tells me that we'll have a
beginner@ that's then totally ignored, and people will quickly learn to
post to advanced@ to get attention, and we'll be back where we started.
Putting it in JIRA doesn't help. I don't think this a problem that is
merely down to lack of process. It actually requires cultivating a culture
change on the community list.

On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf 
wrote:

> What I am suggesting is basically to fix that.
>
> For example, we might say that mailing list A is only for voting, mailing
> list B is only for PR and have something like stack overflow for developer
> questions (I would even go as far as to have beginner, intermediate and
> advanced mailing list for users and beginner/advanced for dev).
>
>
>
> This can easily be done using stack overflow tags, however, that would
> probably be harder to manage.
>
> Maybe using special jira tags and manage it in jira?
>
>
>
> Anyway as I said, the main issue is not user questions (except maybe
> advanced ones) but more for dev questions. It is so easy to get lost in the
> chatter that it makes it very hard for people to learn spark internals…
>
> Assaf.
>
>
>
> *From:* Sean Owen [mailto:so...@cloudera.com]
> *Sent:* Wednesday, November 02, 2016 2:07 PM
> *To:* Mendelson, Assaf; dev@spark.apache.org
> *Subject:* Re: Handling questions in the mailing lists
>
>
>
> I think that unfortunately mailing lists don't scale well. This one has
> thousands of subscribers with different interests and levels of experience.
> For any given person, most messages will be irrelevant. I also find that a
> lot of questions on user@ are not well-asked, aren't an SSCCE (
> http://sscce.org/), not something most people are going to bother
> replying to even if they could answer. I almost entirely ignore user@
> because there are higher-priority channels like PRs to deal with, that
> already have hundreds of messages per day. This is why little of it gets an
> answer -- too noisy.
>
>
>
> We have to have official mailing lists, in any event, to have some
> official channel for things like votes and announcements. It's not wrong to
> ask questions on user@ of course, but a lot of the questions I see could
> have been answered with research of existing docs or looking at the code. I
> think that given the scale of the list, it's not wrong to assert that this
> is sort of a prerequisite for asking thousands of people to answer one's
> question. But we can't enforce that.
>
>
>
> The situation will get better to the extent people ask better questions,
> help other people ask better questions, and answer good questions. I'd
> encourage anyone feeling this way to try to help along those dimensions.
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Nov 2, 2016 at 11:32 AM assaf.mendelson 
> wrote:
>
> Hi,
>
> I know this is a little off topic but I wanted to raise an issue about
> handling questions in the mailing list (this is true both for the user
> mailing list and the dev but since there are other options such as stack
> overflow for user questions, this is more problematic in dev).
>
> Let’s say I ask a question (as I recently did). Unfortunately this was
> during spark summit in Europe so probably people were busy. In any case no
> one answered.
>
> The problem is, that if no one answers very soon, the question will almost
> certainly remain unanswered because new messages will simply drown it.
>
>
>
> This is a common issue not just for questions but for any comment or idea
> which is not immediately picked up.
>
>
>
> I believe we should have a method of handling this.
>
> Generally, I would say these types of things belong in stack overflow,
> after all, the way it is built is perfect for this. More seasoned spark
> contributors and committers can periodically check out unanswered questions
> and answer them.
>
> The problem is that stack overflow (as well as other targets such as the
> databricks forums) tend to have a more user based orientation. This means
> that any spark internal question will almost certainly remain unanswered.
>
>
>
> I was wondering if we could come up with a solution for this.
>
>
>
> Assaf.
>
>
>
>
> --
>
> View this message in context: Handling questions in the mailing lists
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>


RE: Handling questions in the mailing lists

2016-11-02 Thread Mendelson, Assaf
What I am suggesting is basically to fix that.
For example, we might say that mailing list A is only for voting, mailing list 
B is only for PR and have something like stack overflow for developer questions 
(I would even go as far as to have beginner, intermediate and advanced mailing 
list for users and beginner/advanced for dev).

This can easily be done using stack overflow tags, however, that would probably 
be harder to manage.
Maybe using special jira tags and manage it in jira?

Anyway as I said, the main issue is not user questions (except maybe advanced 
ones) but more for dev questions. It is so easy to get lost in the chatter that 
it makes it very hard for people to learn spark internals…
Assaf.

From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, November 02, 2016 2:07 PM
To: Mendelson, Assaf; dev@spark.apache.org
Subject: Re: Handling questions in the mailing lists

I think that unfortunately mailing lists don't scale well. This one has 
thousands of subscribers with different interests and levels of experience. For 
any given person, most messages will be irrelevant. I also find that a lot of 
questions on user@ are not well-asked, aren't an SSCCE (http://sscce.org/), not 
something most people are going to bother replying to even if they could 
answer. I almost entirely ignore user@ because there are higher-priority 
channels like PRs to deal with, that already have hundreds of messages per day. 
This is why little of it gets an answer -- too noisy.

We have to have official mailing lists, in any event, to have some official 
channel for things like votes and announcements. It's not wrong to ask 
questions on user@ of course, but a lot of the questions I see could have been 
answered with research of existing docs or looking at the code. I think that 
given the scale of the list, it's not wrong to assert that this is sort of a 
prerequisite for asking thousands of people to answer one's question. But we 
can't enforce that.

The situation will get better to the extent people ask better questions, help 
other people ask better questions, and answer good questions. I'd encourage 
anyone feeling this way to try to help along those dimensions.





On Wed, Nov 2, 2016 at 11:32 AM assaf.mendelson 
> wrote:
Hi,
I know this is a little off topic but I wanted to raise an issue about handling 
questions in the mailing list (this is true both for the user mailing list and 
the dev but since there are other options such as stack overflow for user 
questions, this is more problematic in dev).
Let’s say I ask a question (as I recently did). Unfortunately this was during 
spark summit in Europe so probably people were busy. In any case no one 
answered.
The problem is, that if no one answers very soon, the question will almost 
certainly remain unanswered because new messages will simply drown it.

This is a common issue not just for questions but for any comment or idea which 
is not immediately picked up.

I believe we should have a method of handling this.
Generally, I would say these types of things belong in stack overflow, after 
all, the way it is built is perfect for this. More seasoned spark contributors 
and committers can periodically check out unanswered questions and answer them.
The problem is that stack overflow (as well as other targets such as the 
databricks forums) tend to have a more user based orientation. This means that 
any spark internal question will almost certainly remain unanswered.

I was wondering if we could come up with a solution for this.

Assaf.



View this message in context: Handling questions in the mailing 
lists
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.


Re: Handling questions in the mailing lists

2016-11-02 Thread Sean Owen
I think that unfortunately mailing lists don't scale well. This one has
thousands of subscribers with different interests and levels of experience.
For any given person, most messages will be irrelevant. I also find that a
lot of questions on user@ are not well-asked, aren't an SSCCE (
http://sscce.org/), not something most people are going to bother replying
to even if they could answer. I almost entirely ignore user@ because there
are higher-priority channels like PRs to deal with, that already have
hundreds of messages per day. This is why little of it gets an answer --
too noisy.

We have to have official mailing lists, in any event, to have some official
channel for things like votes and announcements. It's not wrong to ask
questions on user@ of course, but a lot of the questions I see could have
been answered with research of existing docs or looking at the code. I
think that given the scale of the list, it's not wrong to assert that this
is sort of a prerequisite for asking thousands of people to answer one's
question. But we can't enforce that.

The situation will get better to the extent people ask better questions,
help other people ask better questions, and answer good questions. I'd
encourage anyone feeling this way to try to help along those dimensions.





On Wed, Nov 2, 2016 at 11:32 AM assaf.mendelson 
wrote:

> Hi,
>
> I know this is a little off topic but I wanted to raise an issue about
> handling questions in the mailing list (this is true both for the user
> mailing list and the dev but since there are other options such as stack
> overflow for user questions, this is more problematic in dev).
>
> Let’s say I ask a question (as I recently did). Unfortunately this was
> during spark summit in Europe so probably people were busy. In any case no
> one answered.
>
> The problem is, that if no one answers very soon, the question will almost
> certainly remain unanswered because new messages will simply drown it.
>
>
>
> This is a common issue not just for questions but for any comment or idea
> which is not immediately picked up.
>
>
>
> I believe we should have a method of handling this.
>
> Generally, I would say these types of things belong in stack overflow,
> after all, the way it is built is perfect for this. More seasoned spark
> contributors and committers can periodically check out unanswered questions
> and answer them.
>
> The problem is that stack overflow (as well as other targets such as the
> databricks forums) tend to have a more user based orientation. This means
> that any spark internal question will almost certainly remain unanswered.
>
>
>
> I was wondering if we could come up with a solution for this.
>
>
>
> Assaf.
>
>
>
> --
> View this message in context: Handling questions in the mailing lists
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Handling questions in the mailing lists

2016-11-02 Thread assaf.mendelson
Hi,
I know this is a little off topic but I wanted to raise an issue about handling 
questions in the mailing list (this is true both for the user mailing list and 
the dev but since there are other options such as stack overflow for user 
questions, this is more problematic in dev).
Let's say I ask a question (as I recently did). Unfortunately this was during 
spark summit in Europe so probably people were busy. In any case no one 
answered.
The problem is, that if no one answers very soon, the question will almost 
certainly remain unanswered because new messages will simply drown it.

This is a common issue not just for questions but for any comment or idea which 
is not immediately picked up.

I believe we should have a method of handling this.
Generally, I would say these types of things belong in stack overflow, after 
all, the way it is built is perfect for this. More seasoned spark contributors 
and committers can periodically check out unanswered questions and answer them.
The problem is that stack overflow (as well as other targets such as the 
databricks forums) tend to have a more user based orientation. This means that 
any spark internal question will almost certainly remain unanswered.

I was wondering if we could come up with a solution for this.

Assaf.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-questions-in-the-mailing-lists-tp19690.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Sean Owen
I couldn't figure out why I was missing a lot of dev@ announcements, and
have just realized hundreds of messages to dev@ over the past month or so
have been marked as spam for me by Gmail. I have no idea why but it's
usually messages from Michael and Reynold, but not all of them. I'll see
replies to the messages but not the original. Who knows. I can make a
filter. I just wanted to give a heads up in case anyone else has been
silently missing a lot of messages.