Re: Where is DataFrame.scala in 2.0?

2016-06-03 Thread Michael Malak
It's been reduced to a single line of code.
http://technicaltidbit.blogspot.com/2016/03/dataframedataset-swap-places-in-spark-20.html




  From: Gerhard Fiedler 
 To: "dev@spark.apache.org"  
 Sent: Friday, June 3, 2016 9:01 AM
 Subject: Where is DataFrame.scala in 2.0?
   
 When I look at the sources in Github, I see 
DataFrame.scala 
athttps://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
 in the 1.6 branch. But when I change the branch to branch-2.0 or master, I get 
a 404 error. I also can’t find the file in the directory listings, for example 
https://github.com/apache/spark/tree/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql
 (for branch-2.0).    It seems that quite a few APIs use the DataFrame class, 
even in 2.0. Can someone please point me to its location, or otherwise explain 
why it is not there?    Thanks, Gerhard    

  

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Michael Malak
I see you've been burning the midnight oil.

  From: Reynold Xin 
 To: "dev@spark.apache.org"  
 Sent: Friday, April 1, 2016 1:15 AM
 Subject: [discuss] using deep learning to improve Spark
   
Hi all,
Hope you all enjoyed the Tesla 3 unveiling earlier tonight.
I'd like to bring your attention to a project called DeepSpark that we have 
been working on for the past three years. We realized that scaling software 
development was challenging. A large fraction of software engineering has been 
manual and mundane: writing test cases, fixing bugs, implementing features 
according to specs, and reviewing pull requests. So we started this project to 
see how much we could automate.
After three years of development and one year of testing, we now have enough 
confidence that this could work well in practice. For example, Matei confessed 
to me today: "It looks like DeepSpark has a better understanding of Spark 
internals than I ever will. It updated several pieces of code I wrote long ago 
that even I no longer understood.”

I think it's time to discuss as a community about how we want to continue this 
project to ensure Spark is stable, secure, and easy to use yet able to progress 
as fast as possible. I'm still working on a more formal design doc, and it 
might take a little bit more time since I haven't been able to fully grasp 
DeepSpark's capabilities yet. Based on my understanding right now, I've written 
a blog post about DeepSpark here: 
https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html

Please take a look and share your thoughts. Obviously, this is an ambitious 
project and could take many years to fully implement. One major challenge is 
cost. The current Spark Jenkins infrastructure provided by the AMPLab has only 
8 machines, but DeepSpark uses 12000 machines. I'm not sure whether AMPLab or 
Databricks can fund DeepSpark's operation for a long period of time. Perhaps 
AWS can help out here. Let me know if you have other ideas.







  

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Michael Malak
Would it make sense (in terms of feasibility, code organization, and 
politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines 
to a Java compatibility layer/class?

  From: Reynold Xin 
 To: "dev@spark.apache.org"  
 Sent: Thursday, February 25, 2016 4:23 PM
 Subject: [discuss] DataFrame vs Dataset in Spark 2.0
   
When we first introduced Dataset in 1.6 as an experimental API, we wanted to 
merge Dataset/DataFrame but couldn't because we didn't want to break the 
pre-existing DataFrame API (e.g. map function should return Dataset, rather 
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and 
Dataset.
Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways 
to implement this:
Option 1. Make DataFrame a type alias for Dataset[Row]
Option 2. DataFrame as a concrete class that extends Dataset[Row]

I'm wondering what you think about this. The pros and cons I can think of are:

Option 1. Make DataFrame a type alias for Dataset[Row]
+ Cleaner conceptually, especially in Scala. It will be very clear what 
libraries or applications need to do, and we won't see type mismatches (e.g. a 
function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code- Breaks source compatibility for the DataFrame API in Java, 
and binary compatibility for Scala/Java

Option 2. DataFrame as a concrete class that extends Dataset[Row]
The pros/cons are basically the inverse of Option 1.
+ In most cases, can maintain source compatibility for the DataFrame API in 
Java, and binary compatibility for Scala/Java- A lot more code (1000+ loc)- 
Less cleaner, and can be confusing when users pass in a Dataset[Row] into a 
function that expects a DataFrame

The concerns are mostly with Scala/Java. For Python, it is very easy to 
maintain source compatibility for both (there is no concept of binary 
compatibility), and for R, we are only supporting the DataFrame operations 
anyway because that's more familiar interface for R users outside of Spark.



  

Wrong initial bias in GraphX SVDPlusPlus?

2015-04-03 Thread Michael Malak
I believe that in the initialization portion of GraphX SVDPlusPluS, the 
initialization of biases is incorrect. Specifically, in line 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96
 
instead of 
(vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1)) 
it should be 
(vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 / scala.math.sqrt(msg.get._1)) 

That is, the biases bu and bi (both represented as the third component of the 
Tuple4[] above, depending on whether the vertex is a user or an item), 
described in equation (1) of the Koren paper, are supposed to be small offsets 
to the mean (represented by the variable u, signifying the Greek letter mu) to 
account for peculiarities of individual users and items. 

Initializing these biases to wrong values should theoretically not matter given 
enough iterations of the algorithm, but some quick empirical testing shows it 
has trouble converging at all, even after many orders of magnitude additional 
iterations. 

This perhaps could be the source of previously reported trouble with 
SVDPlusPlus. 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html
 

If after a day, no one tells me I'm crazy here, I'll go ahead and create a Jira 
ticket. 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not 
guaranteed to return the first row (such as looking for a header row)? If so, 
doesn't that make the example in 
http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Word2Vec IndexedRDD

2015-02-01 Thread Michael Malak
1. Is IndexedRDD planned for 1.3? 
https://issues.apache.org/jira/browse/SPARK-2365

2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its 
current Map[String,Array[Float]]? 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L425

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Michael Malak
I personally have no preference DataFrame vs. DataTable, but only wish to lay 
out the history and etymology simply because I'm into that sort of thing.

Frame comes from Marvin Minsky's 1970's AI construct: slots and the data 
that go in them. The S programming language (precursor to R) adopted this 
terminology in 1991. R of course became popular with the rise of Data Science 
around 2012.
http://www.google.com/trends/explore#q=%22data%20science%22%2C%20%22r%20programming%22cmpt=qtz=

DataFrame would carry the implication that it comes along with its own 
metadata, whereas DataTable might carry the implication that metadata is 
stored in a central metadata repository.

DataFrame is thus technically more correct for SchemaRDD, but is a less 
familiar (and thus less accessible) term for those not immersed in data science 
or AI and thus may have narrower appeal.


- Original Message -
From: Evan R. Sparks evan.spa...@gmail.com
To: Matei Zaharia matei.zaha...@gmail.com
Cc: Koert Kuipers ko...@tresata.com; Michael Malak michaelma...@yahoo.com; 
Patrick Wendell pwend...@gmail.com; Reynold Xin r...@databricks.com; 
dev@spark.apache.org dev@spark.apache.org
Sent: Tuesday, January 27, 2015 9:55 AM
Subject: Re: renaming SchemaRDD - DataFrame

I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a Frame is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 (Actually when we designed Spark SQL we thought of giving it another name,
 like Spark Schema, but we decided to stick with SQL since that was the most
 obvious use case to many users.)

 Matei

  On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  While it might be possible to move this concept to Spark Core long-term,
 supporting structured data efficiently does require quite a bit of the
 infrastructure in Spark SQL, such as query planning and columnar storage.
 The intent of Spark SQL though is to be more than a SQL server -- it's
 meant to be a library for manipulating structured data. Since this is
 possible to build over the core API, it's pretty natural to organize it
 that way, same as Spark Streaming is a library.
 
  Matei
 
  On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote:
 
  The context is that SchemaRDD is becoming a common data format used for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API.
 
  i agree. this to me also implies it belongs in spark core, not sql
 
  On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
  michaelma...@yahoo.com.invalid wrote:
 
  And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
 Area
  Spark Meetup YouTube contained a wealth of background information on
 this
  idea (mostly from Patrick and Reynold :-).
 
  https://www.youtube.com/watch?v=YWppYPWznSQ
 
  
  From: Patrick Wendell pwend...@gmail.com
  To: Reynold Xin r...@databricks.com
  Cc: dev@spark.apache.org dev@spark.apache.org
  Sent: Monday, January 26, 2015 4:01 PM
  Subject: Re: renaming SchemaRDD - DataFrame
 
 
  One thing potentially not clear from this e-mail, there will be a 1:1
  correspondence where you can get an RDD to/from a DataFrame.
 
 
  On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com
 wrote:
  Hi,
 
  We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted
 to
  get the community's opinion.
 
  The context is that SchemaRDD is becoming a common data format used
 for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API. We also expect
 more
  and
  more users to be programming directly against SchemaRDD API rather
 than
  the
  core RDD API. SchemaRDD, through its less commonly used DSL originally
  designed for writing test cases, always has the data-frame like API.
 In
  1.3, we are redesigning the API to make the API usable for end users.
 
 
  There are two motivations for the renaming:
 
  1. DataFrame seems to be a more self-evident name than SchemaRDD.
 
  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
 (even
  though it would contain some RDD functions like map, flatMap, etc),
 and
  calling it Schema*RDD* while it is not an RDD is highly confusing.
  Instead.
  DataFrame.rdd will return the underlying RDD for all RDD methods.
 
 
  My understanding is that very few users program directly against the
  SchemaRDD API at the moment, because they are not well documented.
  However,
  oo maintain backward compatibility, we can

Re: GraphX ShortestPaths backwards?

2015-01-20 Thread Michael Malak
I created https://issues.apache.org/jira/browse/SPARK-5343 for this.


- Original Message -
From: Michael Malak michaelma...@yahoo.com
To: dev@spark.apache.org dev@spark.apache.org
Cc: 
Sent: Monday, January 19, 2015 5:09 PM
Subject: GraphX ShortestPaths backwards?

GraphX ShortestPaths seems to be following edges backwards instead of forwards:

import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,

lib.ShortestPaths.run(g,Array(3)).vertices.collect
res1: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
- 0)), (2,Map()))

lib.ShortestPaths.run(g,Array(1)).vertices.collect

res2: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
(3,Map(1 - 2)), (2,Map(1 - 1)))

If I am not mistaken about my assessment, then I believe the following changes 
will make it run forward:

Change one occurrence of src to dst in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64

Change three occurrences of dst to src in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



GraphX ShortestPaths backwards?

2015-01-19 Thread Michael Malak
GraphX ShortestPaths seems to be following edges backwards instead of forwards:

import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,

lib.ShortestPaths.run(g,Array(3)).vertices.collect
res1: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
- 0)), (2,Map()))

lib.ShortestPaths.run(g,Array(1)).vertices.collect

res2: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
(3,Map(1 - 2)), (2,Map(1 - 1)))

If I am not mistaken about my assessment, then I believe the following changes 
will make it run forward:

Change one occurrence of src to dst in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64

Change three occurrences of dst to src in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
But wouldn't the gain be greater under something similar to EdgePartition1D 
(but perhaps better load-balanced based on number of edges for each vertex) and 
an algorithm that primarily follows edges in the forward direction?
  From: Ankur Dave ankurd...@gmail.com
 To: Michael Malak michaelma...@yahoo.com 
Cc: dev@spark.apache.org dev@spark.apache.org 
 Sent: Monday, January 19, 2015 2:08 PM
 Subject: Re: GraphX vertex partition/location strategy
   
No - the vertices are hash-partitioned onto workers independently of the edges. 
It would be nice for each vertex to be on the worker with the most adjacent 
edges, but we haven't done this yet since it would add a lot of complexity to 
avoid load imbalance while reducing the overall communication by a small factor.
We refer to the number of partitions containing adjacent edges for a particular 
vertex as the vertex's replication factor. I think the typical replication 
factor for power-law graphs with 100-200 partitions is 10-15, and placing the 
vertex at the ideal location would only reduce the replication factor by 1.

Ankur


On Mon, Jan 19, 2015 at 12:20 PM, Michael Malak 
michaelma...@yahoo.com.invalid wrote:

Does GraphX make an effort to co-locate vertices onto the same workers as the 
majority (or even some) of its edges?



   

GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
Does GraphX make an effort to co-locate vertices onto the same workers as the 
majority (or even some) of its edges?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Michael Malak
According to:
https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting
 

Note that TriangleCount requires the edges to be in canonical orientation 
(srcId  dstId)

But isn't this overstating the requirement? Isn't the requirement really that 
IF there are duplicate edges between two vertices, THEN those edges must all be 
in the same direction (in order for the groupEdges() at the beginning of 
triangleCount() to produce the intermediate results that triangleCount() 
expects)?

If so, should I enter a JIRA ticket to clarify the documentation?

Or is it the case that https://issues.apache.org/jira/browse/SPARK-3650 will 
make it into Spark 1.3 anyway?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: GraphX rmatGraph hangs

2015-01-04 Thread Michael Malak
Thank you. I created 
https://issues.apache.org/jira/browse/SPARK-5064


- Original Message -
From: xhudik xhu...@gmail.com
To: dev@spark.apache.org
Cc: 
Sent: Saturday, January 3, 2015 2:04 PM
Subject: Re: GraphX rmatGraph hangs

Hi Michael,

yes, I can confirm the behavior.
It get stuck (loop?) and eat all resources, command top gives:
14013 ll 20   0 2998136 489992  19804 S 100.2 12.10  13:29.39 java

You might create an issue/bug in jira
(https://issues.apache.org/jira/browse/SPARK)


best, tomas



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GraphX-rmatGraph-hangs-tp9995p9996.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



GraphX rmatGraph hangs

2015-01-03 Thread Michael Malak
The following single line just hangs, when executed in either Spark Shell or 
standalone:

org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8)

It just outputs 0 edges and then locks up.
The only other information I've found via Google is:

http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408617621830-12570.p...@n3.nabble.com%3E

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



15 new MLlib algorithms

2014-07-09 Thread Michael Malak
At Spark Summit, Patrick Wendell indicated the number of MLlib algorithms would 
roughly double in 1.1 from the current approx. 15.
http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf

What are the planned additional algorithms?

In Jira, I only see two when filtering on version 1.1, component MLlib: one on 
multi-label and another on high dimensionality.

https://issues.apache.org/jira/browse/SPARK-2329?jql=issuetype%20in%20(Brainstorming%2C%20Epic%2C%20%22New%20Feature%22%2C%20Story)%20AND%20fixVersion%20%3D%201.1.0%20AND%20component%20%3D%20MLlib

http://tinyurl.com/ku7sehu


GraphX triplets on 5-node graph

2014-05-29 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I 
missing something fundamental?


val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L, 
N4), (5L, N5))) 
val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2), 
Edge(2L, 4L, E3), Edge(3L, 5L, E4))) 
Graph(nodes, edges).triplets.collect 
res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] = 
Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4), 
((3,N3),(5,N5),E4))


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Michael Malak
While developers may appreciate 1.0 == API stability, I'm not sure that will 
be the understanding of the VP who gives the green light to a Spark-based 
development effort.

I fear a bug that silently produces erroneous results will be perceived like 
the FDIV bug, but in this case without the momentum of an existing large 
installed base and with a number of competitors (GridGain, H20, 
Stratosphere). Despite the stated intention of API stability, the perception 
(which becomes the reality) of 1.0 is that it's ready for production use -- 
not bullet-proof, but also not with known silent generation of erroneous 
results. Exceptions and crashes are much more tolerated than silent corruption 
of data. The result may be a reputation of the Spark team unconcerned about 
data integrity.

I ran into (and submitted) https://issues.apache.org/jira/browse/SPARK-1817 due 
to the lack of zipWithIndex(). zip() with a self-created partitioned range was 
the way I was trying to number with IDs a collection of nodes in preparation 
for the GraphX constructor. For the record, it was a frequent Spark committer 
who escalated it to blocker; I did not submit it as such. Partitioning a 
Scala range isn't just a toy example; it has a real-life use.

I also wonder about the REPL. Cloudera, for example, touts it as key to making 
Spark a crossover tool that Data Scientists can also use. The REPL can be 
considered an API of sorts -- not a traditional Scala or Java API, of course, 
but the API that a human data analyst would use. With the Scala REPL 
exhibiting some of the same bad behaviors as the Spark REPL, there is a 
question of whether the Spark REPL can even be fixed. If the Spark REPL has to 
be eliminated after 1.0 due to an inability to repair it, that would constitute 
API instability.


 
On Saturday, May 17, 2014 2:49 PM, Matei Zaharia matei.zaha...@gmail.com 
wrote:
 
As others have said, the 1.0 milestone is about API stability, not about saying 
“we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can 
confidently build on Spark, knowing that the application they build today will 
still run on Spark 1.9.9 three years from now. This is something that I’ve seen 
done badly (and experienced the effects thereof) in other big data projects, 
such as MapReduce and even YARN. The result is that you annoy users, you end up 
with a fragmented userbase where everyone is building against a different 
version, and you drastically slow down development.

With a project as fast-growing as fast-growing as Spark in particular, there 
will be new bugs discovered and reported continuously, especially in the 
non-core components. Look at the graph of # of contributors in time to Spark: 
https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when 
we started merging each patch as a single commit). This is not slowing down, 
and we need to have the culture now that we treat API stability and release 
numbers at the level expected for a 1.0 project instead of having people come 
in and randomly change the API.

I’ll also note that the issues marked “blocker” were marked so by their 
reporters, since the reporter can set the priority. I don’t consider stuff like 
parallelize() not partitioning ranges in the same way as other collections a 
blocker — it’s a bug, it would be good to fix it, but it only affects a small 
number of use cases. Of course if we find a real blocker (in particular a 
regression from a previous version, or a feature that’s just completely 
broken), we will delay the release for that, but at some point you have to say 
“okay, this fix will go into the next maintenance release”. Maybe we need to 
write a clear policy for what the issue priorities mean.

Finally, I believe it’s much better to have a culture where you can make 
releases on a regular schedule, and have the option to make a maintenance 
release in 3-4 days if you find new bugs, than one where you pile up stuff into 
each release. This is what much large project than us, like Linux, do, and it’s 
the only way to avoid indefinite stalling with a large contributor base. In the 
worst case, if you find a new bug that warrants immediate release, it goes into 
1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug 
fix in it). And if you find an API that you’d like to improve, just add a new 
one and maybe deprecate the old one — at some point we have to respect our 
users and let them know that code they write today will still run tomorrow.

Matei


On May 17, 2014, at 10:32 AM, Kan Zhang kzh...@apache.org wrote:

 +1 on the running commentary here, non-binding of course :-)
 
 
 On Sat, May 17, 2014 at 8:44 AM, Andrew Ash and...@andrewash.com wrote:
 
 +1 on the next release feeling more like a 0.10 than a 1.0
 On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com wrote:
 
 I had echoed similar sentiments a while back when there was a discussion
 around 0.10 vs 1.0 ... I would have 

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user:

I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In 
the Spark Shell, equals() fails when I use the canonical equals() pattern of 
match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 
0.9.0/Scala 2.10.3.

Is this a bug?

Spark Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = false

Spark Shell (equals uses isInstanceOf[])


class C(val s:String) extends Serializable {
  override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == 
s) else false
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true

Scala Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true


Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Thank you for your investigation into this!

Just for completeness, I've confirmed it's a problem only in REPL, not in 
compiled Spark programs.

But within REPL, a direct consequence of non-same classes after 
serialization/deserialization also means that lookup() doesn't work:

scala class C(val s:String) extends Serializable {
 |   override def equals(o: Any) = if (o.isInstanceOf[C]) 
o.asInstanceOf[C].s == s else false
 |   override def toString = s
 | }
defined class C

scala val r = sc.parallelize(Array((new C(a),11),(new C(a),12)))
r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize 
at console:14

scala r.lookup(new C(a))
console:17: error: type mismatch;
 found   : C
 required: C
  r.lookup(new C(a))
   ^



On Tuesday, May 13, 2014 3:05 PM, Anand Avati av...@gluster.org wrote:

On Tue, May 13, 2014 at 8:26 AM, Michael Malak michaelma...@yahoo.com wrote:

Reposting here on dev since I didn't see a response on user:

I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In 
the Spark Shell, equals() fails when I use the canonical equals() pattern of 
match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 
0.9.0/Scala 2.10.3.

Is this a bug?

Spark Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = false

Spark Shell (equals uses isInstanceOf[])


class C(val s:String) extends Serializable {
  override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == 
s) else false
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true

Scala Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true



Hmm. I see that this can be reproduced without Spark in Scala 2.11, with and 
without -Yrepl-class-based command line flag to the repl. Spark's REPL has the 
effective behavior of 2.11's -Yrepl-class-based flag. Inspecting the byte code 
generated, it appears -Yrepl-class-based results in the creation of $outer 
field in the generated classes (including class C). The first case match in 
equals() is resulting code along the lines of (simplified):

if (o isinstanceof Cstr  this.$outer == that.$outer) { // do string compare 
// }

$outer is the synthetic field object to the outer object in which the object 
was created (in this case, the repl environment). Now obviously, when x is 
taken through the bytestream and deserialized, it would have a new $outer 
created (it may have deserialized in a different jvm or machine for all we 
know). So the $outer's mismatching is expected. However I'm still trying to 
understand why the outers need to be the same for the case match.