Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Shivaram Venkataraman
Yeah I see the apache maven repos have the 2.0.1 artifacts at
https://repository.apache.org/content/repositories/releases/org/apache/spark/spark-core_2.11/
-- Not sure why they haven't synced to maven central yet

Shivaram

On Wed, Oct 5, 2016 at 8:37 PM, Luciano Resende  wrote:
> It usually don't take that long to be synced, I still don't see any 2.0.1
> related artifacts on maven central
>
> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22%20AND%20v%3A%222.0.1%22
>
>
> On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin  wrote:
>>
>> They have been published yesterday, but can take a while to propagate.
>>
>>
>> On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar 
>> wrote:
>>>
>>> Hi,
>>>
>>> It seems like, 2.0.1 artifact hasn't been published to Maven Central. Can
>>> anyone confirm?
>>>
>>> On Tue, Oct 4, 2016 at 5:39 PM, Reynold Xin  wrote:

 We are happy to announce the availability of Spark 2.0.1!

 Apache Spark 2.0.1 is a maintenance release containing 300 stability and
 bug fixes. This release is based on the branch-2.0 maintenance branch of
 Spark. We strongly recommend all 2.0.0 users to upgrade to this stable
 release.

 To download Apache Spark 2.0.1, visit
 http://spark.apache.org/downloads.html

 We would like to acknowledge all community members for contributing
 patches to this release.


>>>
>>>
>>>
>>> --
>>> --
>>> Cheers,
>>> Praj
>>
>>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-10-05 Thread Reynold Xin
I think this is fairly important to do so I went ahead and created a PR for
the first mini step: https://github.com/apache/spark/pull/15374



On Wed, Aug 24, 2016 at 9:48 AM, Reynold Xin  wrote:

> Looks like I'm general people like it. Next step is for somebody to take
> the lead and implement it.
>
> Tom do you have cycles to do this?
>
>
> On Wednesday, August 24, 2016, Tom Graves  wrote:
>
>> ping, did this discussion conclude or did we decide what we are doing?
>>
>> Tom
>>
>>
>> On Friday, May 13, 2016 3:19 PM, Michael Armbrust 
>> wrote:
>>
>>
>> +1 to the general structure of Reynold's proposal.  I've found what we do
>> currently a little confusing.  In particular, it doesn't make much sense
>> that @DeveloperApi things are always labeled as possibly changing.  For
>> example the Data Source API should arguably be one of the most stable
>> interfaces since its very difficult for users to recompile libraries that
>> might break when there are changes.
>>
>> For a similar reason, I don't really see the point of LimitedPrivate.
>> The goal here should be communication of promises of stability or future
>> stability.
>>
>> Regarding Developer vs. Public. I don't care too much about the naming,
>> but it does seem useful to differentiate APIs that we expect end users to
>> consume from those that are used to augment Spark. "Library" and
>> "Application" also seem reasonable.
>>
>> On Fri, May 13, 2016 at 11:15 AM, Marcelo Vanzin 
>> wrote:
>>
>> On Fri, May 13, 2016 at 10:18 AM, Sean Busbey 
>> wrote:
>> > I think LimitedPrivate gets a bad rap due to the way it is misused in
>> > Hadoop. The use case here -- "we offer this to developers of
>> > intermediate layers; those willing to update their software as we
>> > update ours"
>>
>> I think "LimitedPrivate" is a rather confusing name for that. I think
>> Reynold's first e-mail better matches that use case: this would be
>> "InterfaceAudience(Developer)" and "InterfaceStability(Experimental)".
>>
>> But I don't really like "Developer" as a name here, because it's
>> ambiguous. Developer of what? Theoretically everybody writing Spark or
>> on top of its APIs is a developer. In that sense, I prefer using
>> something like "Library" and "Application" instead of "Developer" and
>> "Public".
>>
>> Personally, in fact, I don't see a lot of gain in differentiating
>> between the target users of an interface... knowing whether it's a
>> stable interface or not is a lot more useful. If you're equating a
>> "developer API" with "it's not really stable", then you don't really
>> need two annotations for that - just say it's not stable.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>>


Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Luciano Resende
It usually don't take that long to be synced, I still don't see any 2.0.1
related artifacts on maven central

http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22%20AND%20v%3A%222.0.1%22


On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin  wrote:

> They have been published yesterday, but can take a while to propagate.
>
>
> On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar 
> wrote:
>
>> Hi,
>>
>> It seems like, 2.0.1 artifact hasn't been published to Maven Central. Can
>> anyone confirm?
>>
>> On Tue, Oct 4, 2016 at 5:39 PM, Reynold Xin  wrote:
>>
>>> We are happy to announce the availability of Spark 2.0.1!
>>>
>>> Apache Spark 2.0.1 is a maintenance release containing 300 stability and
>>> bug fixes. This release is based on the branch-2.0 maintenance branch of
>>> Spark. We strongly recommend all 2.0.0 users to upgrade to this stable
>>> release.
>>>
>>> To download Apache Spark 2.0.1, visit http://spark.apache.org/downlo
>>> ads.html
>>>
>>> We would like to acknowledge all community members for contributing
>>> patches to this release.
>>>
>>>
>>>
>>
>>
>> --
>> --
>> Cheers,
>> Praj
>>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: welcoming Xiao Li as a committer

2016-10-05 Thread Liwei Lin
Congratulations, Xiao!

Cheers,
Liwei

On Thu, Oct 6, 2016 at 5:38 AM, DB Tsai  wrote:

> Congrats, Xiao!
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>
>
> On Wed, Oct 5, 2016 at 2:36 PM, Fred Reiss  wrote:
> > Congratulations, Xiao!
> >
> > Fred
> >
> >
> > On Tuesday, October 4, 2016, Joseph Bradley 
> wrote:
> >>
> >> Congrats!
> >>
> >> On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta <
> saru...@oss.nttdata.co.jp>
> >> wrote:
> >>>
> >>> Congratulations Xiao!
> >>>
> >>> - Kousuke
> >>>
> >>> On 2016/10/05 7:44, Bryan Cutler wrote:
> >>>
> >>> Congrats Xiao!
> >>>
> >>> On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau 
> >>> wrote:
> 
>  Congratulations :D :) Yay!
> 
>  On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati
>   wrote:
> >
> > Congratulations, Xiao!
> >
> >
> >
> > > On Oct 3, 2016, at 10:46 PM, Reynold Xin 
> wrote:
> > >
> > > Hi all,
> > >
> > > Xiao Li, aka gatorsmile, has recently been elected as an Apache
> Spark
> > > committer. Xiao has been a super active contributor to Spark SQL.
> Congrats
> > > and welcome, Xiao!
> > >
> > > - Reynold
> > >
> >
> >
> > 
> -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> 
> 
> 
>  --
>  Cell : 425-233-8271
>  Twitter: https://twitter.com/holdenkarau
> >>>
> >>>
> >>>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-05 Thread Fred Reiss
Thanks for the thoughtful comments, Michael and Shivaram. From what I’ve
seen in this thread and on JIRA, it looks like the current plan with regard
to application-facing APIs for sinks is roughly:
1. Rewrite incremental query compilation for Structured Streaming.
2. Redesign Structured Streaming's source and sink APIs so that they do not
depend on RDDs.
3. Allow the new APIs to stabilize.
4. Open these APIs to use by application code.

Is there a way for those of us who aren’t involved in the first two steps
to get some idea of the current plans and progress? I get asked a lot about
when Structured Streaming will be a viable replacement for Spark Streaming,
and I like to be able to give accurate advice.

Fred

On Tue, Oct 4, 2016 at 3:02 PM, Michael Armbrust 
wrote:

> I don't quite understand why exposing it indirectly through a typed
>> interface should be delayed before finalizing the API.
>>
>
> Spark has a long history
>  of maintaining
> binary compatibility in its public APIs.  I strongly believe this is one of
> the things that has made the project successful.  Exposing internals that
> we know are going to change in the primary user facing API for creating
> Streaming DataFrames seems directly counter to this goal.  I think the
> argument that "you can do it anyway" fails to capture user expectations who
> probably aren't closely following this discussion.
>
> If advanced users want to dig though the code and experiment, great.  I
> hope they report back on whats good and what can be improved.  However, if
> you add the function suggested in the PR to DataStreamReader, you are
> giving them a bad experience by leaking internals that don't even show up
> in the published documentation.
>


Re: welcoming Xiao Li as a committer

2016-10-05 Thread DB Tsai
Congrats, Xiao!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x9DCC1DBD7FC7BBB2


On Wed, Oct 5, 2016 at 2:36 PM, Fred Reiss  wrote:
> Congratulations, Xiao!
>
> Fred
>
>
> On Tuesday, October 4, 2016, Joseph Bradley  wrote:
>>
>> Congrats!
>>
>> On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta 
>> wrote:
>>>
>>> Congratulations Xiao!
>>>
>>> - Kousuke
>>>
>>> On 2016/10/05 7:44, Bryan Cutler wrote:
>>>
>>> Congrats Xiao!
>>>
>>> On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau 
>>> wrote:

 Congratulations :D :) Yay!

 On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati
  wrote:
>
> Congratulations, Xiao!
>
>
>
> > On Oct 3, 2016, at 10:46 PM, Reynold Xin  wrote:
> >
> > Hi all,
> >
> > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
> > committer. Xiao has been a super active contributor to Spark SQL. 
> > Congrats
> > and welcome, Xiao!
> >
> > - Reynold
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>



 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: welcoming Xiao Li as a committer

2016-10-05 Thread Fred Reiss
Congratulations, Xiao!

Fred

On Tuesday, October 4, 2016, Joseph Bradley  wrote:

> Congrats!
>
> On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta  > wrote:
>
>> Congratulations Xiao!
>>
>> - Kousuke
>> On 2016/10/05 7:44, Bryan Cutler wrote:
>>
>> Congrats Xiao!
>>
>> On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau > > wrote:
>>
>>> Congratulations :D :) Yay!
>>>
>>> On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati <
>>> suresh.thalam...@gmail.com
>>> > wrote:
>>>
 Congratulations, Xiao!



 > On Oct 3, 2016, at 10:46 PM, Reynold Xin >>> > wrote:
 >
 > Hi all,
 >
 > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
 committer. Xiao has been a super active contributor to Spark SQL. Congrats
 and welcome, Xiao!
 >
 > - Reynold
 >


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 


>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>


PySpark UDF Performance Exploration w/Jython (Early/rough 2~3X improvement*) [SPARK-15369]

2016-10-05 Thread Holden Karau
Hi Python Spark Developers & Users,

As Datasets/DataFrames are becoming the core building block of Spark, and
as someone who cares about Python Spark performance, I've been looking more
at PySpark UDF performance.

I've got an early WIP/request for comments pull request open
 with a corresponding design
document

and
JIRA (SPARK-15369)  that
allows for selective UDF evaluation in Jython . Now
that Spark 2.0.1 is out I'd really love peoples input or feedback on this
proposal so I can circle back with a more complete PR :) I'd love to hear
from people using PySpark if this is something which looks interesting (as
well as the PySpark developers) for some of the open questions :)

For users: If you have simple Python UDFs (or even better UDFs and
datasets) that you can share for bench-marking it would be really useful to
be able to add them to the bench-marking I've been looking at in the design
doc. It would also be useful to know if some, many, or none, of your UDFs
can be evaluated by Jython. If you have UDF you aren't comfortable sharing
on-list feel free to each out to me directly.

Some general open questions:

1) The draft PR does some magic** to allow being passed in functions at
least some of the time - is that something which people are interested in
or would it be better to leave the magic out and just require a string
representing the lambda be passed in?

2) Would it be useful to provide easy steps to use JyNI 
 (its LGPL licensed  so I
don't think we we can include it out of the bo
x - but we could try
and make it easy for users to link with if its important)?

3) While we have a 2x speedup for tokenization/wordcount (getting close to
native scala perf) - what is performance like for other workloads (please
share your desired UDFs/workloads for my evil bench-marking plans)?

4) What does the eventual Dataset API look like for Python? (This could
partially influence #1)?

5) How important it is to not add the Jython dependencies to the weight for
non-Python users (and if desired which work around to chose - maybe
something like spark-hive?)

6) Do you often chain PySpark UDF operations and is that something we
should try and optimize for in Jython as well?

7) How many of your Python UDFs can / can not be evaluated in Jython for
one reason or another?

8) Do your UDFs depend on Spark accumulators or broadcast values?

9) What am I forgetting in my coffee fueled happiness?

Cheers,

Holden :)

*Bench-marking has been very limited 2~3X improvement likely different for
"real" work loads (unless you really like doing wordcount :p :))
** Note: magic depends on dill .

P.S.

I leave you with this optimistic 80s style intro screen
 :)
Also if anyone happens to be going to PyData DC 
this weekend I'd love to chat with you in person about this (and of course
circle it back to the mailing list).
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Sean Owen
https://github.com/apache/spark/releases/tag/v2.0.1 ?

On Wed, Oct 5, 2016 at 8:06 PM Michael Gummelt 
wrote:

> There seems to be no 2.0.1 tag?
>
> https://github.com/apache/spark/tags
>
> On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin  wrote:
>
> They have been published yesterday, but can take a while to propagate.
>
>


Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Reynold Xin
There is now. Thanks for the email.

On Wed, Oct 5, 2016 at 12:06 PM, Michael Gummelt 
wrote:

> There seems to be no 2.0.1 tag?
>
> https://github.com/apache/spark/tags
>
> On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin  wrote:
>
>> They have been published yesterday, but can take a while to propagate.
>>
>>
>> On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar 
>> wrote:
>>
>>> Hi,
>>>
>>> It seems like, 2.0.1 artifact hasn't been published to Maven Central.
>>> Can anyone confirm?
>>>
>>> On Tue, Oct 4, 2016 at 5:39 PM, Reynold Xin  wrote:
>>>
 We are happy to announce the availability of Spark 2.0.1!

 Apache Spark 2.0.1 is a maintenance release containing 300 stability
 and bug fixes. This release is based on the branch-2.0 maintenance branch
 of Spark. We strongly recommend all 2.0.0 users to upgrade to this stable
 release.

 To download Apache Spark 2.0.1, visit http://spark.apache.org/downlo
 ads.html

 We would like to acknowledge all community members for contributing
 patches to this release.



>>>
>>>
>>> --
>>> --
>>> Cheers,
>>> Praj
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Michael Gummelt
There seems to be no 2.0.1 tag?

https://github.com/apache/spark/tags

On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin  wrote:

> They have been published yesterday, but can take a while to propagate.
>
>
> On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar 
> wrote:
>
>> Hi,
>>
>> It seems like, 2.0.1 artifact hasn't been published to Maven Central. Can
>> anyone confirm?
>>
>> On Tue, Oct 4, 2016 at 5:39 PM, Reynold Xin  wrote:
>>
>>> We are happy to announce the availability of Spark 2.0.1!
>>>
>>> Apache Spark 2.0.1 is a maintenance release containing 300 stability and
>>> bug fixes. This release is based on the branch-2.0 maintenance branch of
>>> Spark. We strongly recommend all 2.0.0 users to upgrade to this stable
>>> release.
>>>
>>> To download Apache Spark 2.0.1, visit http://spark.apache.org/downlo
>>> ads.html
>>>
>>> We would like to acknowledge all community members for contributing
>>> patches to this release.
>>>
>>>
>>>
>>
>>
>> --
>> --
>> Cheers,
>> Praj
>>
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: java.util.NoSuchElementException when serializing Map with default value

2016-10-05 Thread Kabeer Ahmed

Hi Jakob,

I had multiple versions of Spark installed in my machine. The code now 
works without issues in spark-shell and the IDE. I have verified this 
with Spark 1.6 and 2.0.


Cheers,
Kabeer.


On Mon, 3 Oct, 2016 at 7:30 PM, Jakob Odersky  wrote:

Hi Kabeer,

which version of Spark are you using? I can't reproduce the error in
latest Spark master.

regards,
--Jakob


On Sun, 2 Oct, 2016 at 11:39 PM, Kabeer Ahmed  
wrote:
I have had a quick look at the query from Maciej. I see different 
behaviour while running the piece of code in spark-shell and a 
different one while running it as spark app.


1. While running in the spark-shell, I see the serialization error 
that Maciej has reported.
2. But while running the same code as SparkApp, I see a different 
behaviour.


I have put the code below. It would be great if someone can explain 
the difference in behaviour.


Thanks,
Kabeer.


Spark-Shell:
scala> sc.stop

scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark._
val sc = new SparkContext(new
  SparkConf().setAppName("bar").set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer"))

  println(sc.getConf.getOption("spark.serializer"))

  val m = Map("a" -> 1, "b" -> 2)
  val rdd5 = sc.makeRDD(Seq(m))
  println("Map RDD is: ")
  def mapFunc(input: Map[String, Int]) : Unit = 
println(input.getOrElse("a", -2))

  rdd5.map(mapFunc).collect()

// Exiting paste mode, now interpreting.

Some(org.apache.spark.serializer.KryoSerializer)
Map RDD is:
org.apache.spark.SparkException: Task not serializable
 at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
 at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
 at 
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)

-



Scenario 2:

Code:

package experiment

import org.apache.spark._

object Serialization1 extends App {

  val sc = new SparkContext(new
  SparkConf().setAppName("bar").set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
  .setMaster("local[1]")
  )

  println(sc.getConf.getOption("spark.serializer"))

  val m = Map("a" -> 1, "b" -> 2)
  val rdd5 = sc.makeRDD(Seq(m))
  println("Map RDD is: ")
  def mapFunc(input: Map[String, Int]) : Unit = 
println(input.getOrElse("a", -2))

  rdd5.map(mapFunc).collect()

}

Run command:

spark-submit --class experiment.Serialization1 
target/scala-2.10/learningspark_2.10-0.1-SNAPSHOT.jar


---




On Thu, 29 Sep, 2016 at 1:05 AM, Jakob Odersky  
wrote:

I agree with Sean's answer, you can check out the relevant serializer
here 
https://github.com/twitter/chill/blob/develop/chill-scala/src/main/scala/com/twitter/chill/Traversable.scala


On Wed, Sep 28, 2016 at 3:11 AM, Sean Owen  wrote:
 My guess is that Kryo specially handles Maps generically or relies on
 some mechanism that does, and it happens to iterate over all
 key/values as part of that and of course there aren't actually any
 key/values in the map. The Java serialization is a much more literal
 (expensive) field-by-field serialization which works here because
 there's no special treatment. I think you could register a custom
 serializer that handles this case. Or work around it in your client
 code. I know there have been other issues with Kryo and Map because,
 for example, sometimes a Map in an application is actually some
 non-serializable wrapper view.

 On Wed, Sep 28, 2016 at 3:18 AM, Maciej Szymkiewicz
  wrote:
 Hi everyone,

 I suspect there is no point in submitting a JIRA to fix this (not a 
Spark
 issue?) but I would like to know if this problem is documented 
anywhere.

 Somehow Kryo is loosing default value during serialization:

 scala> import org.apache.spark.{SparkContext, SparkConf}
 import org.apache.spark.{SparkContext, SparkConf}

 scala> val aMap = Map[String, Long]().withDefaultValue(0L)
 aMap: scala.collection.immutable.Map[String,Long] = Map()

 scala> aMap("a")
 res6: Long = 0

 scala> val sc = new SparkContext(new
 SparkConf().setAppName("bar").set("spark.serializer",
 "org.apache.spark.serializer.KryoSerializer"))

 scala> sc.parallelize(Seq(aMap)).map(_("a")).first
 16/09/28 09:13:47 ERROR Executor: Exception in task 2.0 in stage 2.0 
(TID 7)

 java.util.NoSuchElementException: key not found: a

 while Java serializer works just fine:

 scala> val sc = new SparkContext(new
 SparkConf().setAppName("bar").set("spark.serializer",
 "org.apache.spark.serializer.JavaSerializer"))

 scala> sc.parallelize(Seq(aMap)).map(_("a")).first
 res9: Long = 0

 --
 Best regards,
 Maciej

 --