Re: Where is DataFrame.scala in 2.0?
It's been reduced to a single line of code. http://technicaltidbit.blogspot.com/2016/03/dataframedataset-swap-places-in-spark-20.html From: Gerhard FiedlerTo: "dev@spark.apache.org" Sent: Friday, June 3, 2016 9:01 AM Subject: Where is DataFrame.scala in 2.0? When I look at the sources in Github, I see DataFrame.scala athttps://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala in the 1.6 branch. But when I change the branch to branch-2.0 or master, I get a 404 error. I also can’t find the file in the directory listings, for example https://github.com/apache/spark/tree/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql (for branch-2.0). It seems that quite a few APIs use the DataFrame class, even in 2.0. Can someone please point me to its location, or otherwise explain why it is not there? Thanks, Gerhard
Re: [discuss] using deep learning to improve Spark
I see you've been burning the midnight oil. From: Reynold XinTo: "dev@spark.apache.org" Sent: Friday, April 1, 2016 1:15 AM Subject: [discuss] using deep learning to improve Spark Hi all, Hope you all enjoyed the Tesla 3 unveiling earlier tonight. I'd like to bring your attention to a project called DeepSpark that we have been working on for the past three years. We realized that scaling software development was challenging. A large fraction of software engineering has been manual and mundane: writing test cases, fixing bugs, implementing features according to specs, and reviewing pull requests. So we started this project to see how much we could automate. After three years of development and one year of testing, we now have enough confidence that this could work well in practice. For example, Matei confessed to me today: "It looks like DeepSpark has a better understanding of Spark internals than I ever will. It updated several pieces of code I wrote long ago that even I no longer understood.” I think it's time to discuss as a community about how we want to continue this project to ensure Spark is stable, secure, and easy to use yet able to progress as fast as possible. I'm still working on a more formal design doc, and it might take a little bit more time since I haven't been able to fully grasp DeepSpark's capabilities yet. Based on my understanding right now, I've written a blog post about DeepSpark here: https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html Please take a look and share your thoughts. Obviously, this is an ambitious project and could take many years to fully implement. One major challenge is cost. The current Spark Jenkins infrastructure provided by the AMPLab has only 8 machines, but DeepSpark uses 12000 machines. I'm not sure whether AMPLab or Databricks can fund DeepSpark's operation for a long period of time. Perhaps AWS can help out here. Let me know if you have other ideas.
Re: [discuss] DataFrame vs Dataset in Spark 2.0
Would it make sense (in terms of feasibility, code organization, and politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines to a Java compatibility layer/class? From: Reynold XinTo: "dev@spark.apache.org" Sent: Thursday, February 25, 2016 4:23 PM Subject: [discuss] DataFrame vs Dataset in Spark 2.0 When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and Dataset. Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways to implement this: Option 1. Make DataFrame a type alias for Dataset[Row] Option 2. DataFrame as a concrete class that extends Dataset[Row] I'm wondering what you think about this. The pros and cons I can think of are: Option 1. Make DataFrame a type alias for Dataset[Row] + Cleaner conceptually, especially in Scala. It will be very clear what libraries or applications need to do, and we won't see type mismatches (e.g. a function expects DataFrame, but user is passing in Dataset[Row] + A lot less code- Breaks source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java Option 2. DataFrame as a concrete class that extends Dataset[Row] The pros/cons are basically the inverse of Option 1. + In most cases, can maintain source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java- A lot more code (1000+ loc)- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a function that expects a DataFrame The concerns are mostly with Scala/Java. For Python, it is very easy to maintain source compatibility for both (there is no concept of binary compatibility), and for R, we are only supporting the DataFrame operations anyway because that's more familiar interface for R users outside of Spark.
Wrong initial bias in GraphX SVDPlusPlus?
I believe that in the initialization portion of GraphX SVDPlusPluS, the initialization of biases is incorrect. Specifically, in line https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96 instead of (vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1)) it should be (vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 / scala.math.sqrt(msg.get._1)) That is, the biases bu and bi (both represented as the third component of the Tuple4[] above, depending on whether the vertex is a user or an item), described in equation (1) of the Koren paper, are supposed to be small offsets to the mean (represented by the variable u, signifying the Greek letter mu) to account for peculiarities of individual users and items. Initializing these biases to wrong values should theoretically not matter given enough iterations of the algorithm, but some quick empirical testing shows it has trouble converging at all, even after many orders of magnitude additional iterations. This perhaps could be the source of previously reported trouble with SVDPlusPlus. http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html If after a day, no one tells me I'm crazy here, I'll go ahead and create a Jira ticket. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
textFile() ordering and header rows
Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Word2Vec IndexedRDD
1. Is IndexedRDD planned for 1.3? https://issues.apache.org/jira/browse/SPARK-2365 2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its current Map[String,Array[Float]]? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L425 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: renaming SchemaRDD - DataFrame
I personally have no preference DataFrame vs. DataTable, but only wish to lay out the history and etymology simply because I'm into that sort of thing. Frame comes from Marvin Minsky's 1970's AI construct: slots and the data that go in them. The S programming language (precursor to R) adopted this terminology in 1991. R of course became popular with the rise of Data Science around 2012. http://www.google.com/trends/explore#q=%22data%20science%22%2C%20%22r%20programming%22cmpt=qtz= DataFrame would carry the implication that it comes along with its own metadata, whereas DataTable might carry the implication that metadata is stored in a central metadata repository. DataFrame is thus technically more correct for SchemaRDD, but is a less familiar (and thus less accessible) term for those not immersed in data science or AI and thus may have narrower appeal. - Original Message - From: Evan R. Sparks evan.spa...@gmail.com To: Matei Zaharia matei.zaha...@gmail.com Cc: Koert Kuipers ko...@tresata.com; Michael Malak michaelma...@yahoo.com; Patrick Wendell pwend...@gmail.com; Reynold Xin r...@databricks.com; dev@spark.apache.org dev@spark.apache.org Sent: Tuesday, January 27, 2015 9:55 AM Subject: Re: renaming SchemaRDD - DataFrame I'm +1 on this, although a little worried about unknowingly introducing SparkSQL dependencies every time someone wants to use this. It would be great if the interface can be abstract and the implementation (in this case, SparkSQL backend) could be swapped out. One alternative suggestion on the name - why not call it DataTable? DataFrame seems like a name carried over from pandas (and by extension, R), and it's never been obvious to me what a Frame is. On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: (Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can
Re: GraphX ShortestPaths backwards?
I created https://issues.apache.org/jira/browse/SPARK-5343 for this. - Original Message - From: Michael Malak michaelma...@yahoo.com To: dev@spark.apache.org dev@spark.apache.org Cc: Sent: Monday, January 19, 2015 5:09 PM Subject: GraphX ShortestPaths backwards? GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) If I am not mistaken about my assessment, then I believe the following changes will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
GraphX ShortestPaths backwards?
GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) If I am not mistaken about my assessment, then I believe the following changes will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: GraphX vertex partition/location strategy
But wouldn't the gain be greater under something similar to EdgePartition1D (but perhaps better load-balanced based on number of edges for each vertex) and an algorithm that primarily follows edges in the forward direction? From: Ankur Dave ankurd...@gmail.com To: Michael Malak michaelma...@yahoo.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 19, 2015 2:08 PM Subject: Re: GraphX vertex partition/location strategy No - the vertices are hash-partitioned onto workers independently of the edges. It would be nice for each vertex to be on the worker with the most adjacent edges, but we haven't done this yet since it would add a lot of complexity to avoid load imbalance while reducing the overall communication by a small factor. We refer to the number of partitions containing adjacent edges for a particular vertex as the vertex's replication factor. I think the typical replication factor for power-law graphs with 100-200 partitions is 10-15, and placing the vertex at the ideal location would only reduce the replication factor by 1. Ankur On Mon, Jan 19, 2015 at 12:20 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges?
GraphX vertex partition/location strategy
Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
GraphX doc: triangleCount() requirement overstatement?
According to: https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting Note that TriangleCount requires the edges to be in canonical orientation (srcId dstId) But isn't this overstating the requirement? Isn't the requirement really that IF there are duplicate edges between two vertices, THEN those edges must all be in the same direction (in order for the groupEdges() at the beginning of triangleCount() to produce the intermediate results that triangleCount() expects)? If so, should I enter a JIRA ticket to clarify the documentation? Or is it the case that https://issues.apache.org/jira/browse/SPARK-3650 will make it into Spark 1.3 anyway? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: GraphX rmatGraph hangs
Thank you. I created https://issues.apache.org/jira/browse/SPARK-5064 - Original Message - From: xhudik xhu...@gmail.com To: dev@spark.apache.org Cc: Sent: Saturday, January 3, 2015 2:04 PM Subject: Re: GraphX rmatGraph hangs Hi Michael, yes, I can confirm the behavior. It get stuck (loop?) and eat all resources, command top gives: 14013 ll 20 0 2998136 489992 19804 S 100.2 12.10 13:29.39 java You might create an issue/bug in jira (https://issues.apache.org/jira/browse/SPARK) best, tomas -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GraphX-rmatGraph-hangs-tp9995p9996.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
GraphX rmatGraph hangs
The following single line just hangs, when executed in either Spark Shell or standalone: org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8) It just outputs 0 edges and then locks up. The only other information I've found via Google is: http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408617621830-12570.p...@n3.nabble.com%3E - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
15 new MLlib algorithms
At Spark Summit, Patrick Wendell indicated the number of MLlib algorithms would roughly double in 1.1 from the current approx. 15. http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf What are the planned additional algorithms? In Jira, I only see two when filtering on version 1.1, component MLlib: one on multi-label and another on high dimensionality. https://issues.apache.org/jira/browse/SPARK-2329?jql=issuetype%20in%20(Brainstorming%2C%20Epic%2C%20%22New%20Feature%22%2C%20Story)%20AND%20fixVersion%20%3D%201.1.0%20AND%20component%20%3D%20MLlib http://tinyurl.com/ku7sehu
GraphX triplets on 5-node graph
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L, N4), (5L, N5))) val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2), Edge(2L, 4L, E3), Edge(3L, 5L, E4))) Graph(nodes, edges).triplets.collect res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] = Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4), ((3,N3),(5,N5),E4))
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
While developers may appreciate 1.0 == API stability, I'm not sure that will be the understanding of the VP who gives the green light to a Spark-based development effort. I fear a bug that silently produces erroneous results will be perceived like the FDIV bug, but in this case without the momentum of an existing large installed base and with a number of competitors (GridGain, H20, Stratosphere). Despite the stated intention of API stability, the perception (which becomes the reality) of 1.0 is that it's ready for production use -- not bullet-proof, but also not with known silent generation of erroneous results. Exceptions and crashes are much more tolerated than silent corruption of data. The result may be a reputation of the Spark team unconcerned about data integrity. I ran into (and submitted) https://issues.apache.org/jira/browse/SPARK-1817 due to the lack of zipWithIndex(). zip() with a self-created partitioned range was the way I was trying to number with IDs a collection of nodes in preparation for the GraphX constructor. For the record, it was a frequent Spark committer who escalated it to blocker; I did not submit it as such. Partitioning a Scala range isn't just a toy example; it has a real-life use. I also wonder about the REPL. Cloudera, for example, touts it as key to making Spark a crossover tool that Data Scientists can also use. The REPL can be considered an API of sorts -- not a traditional Scala or Java API, of course, but the API that a human data analyst would use. With the Scala REPL exhibiting some of the same bad behaviors as the Spark REPL, there is a question of whether the Spark REPL can even be fixed. If the Spark REPL has to be eliminated after 1.0 due to an inability to repair it, that would constitute API instability. On Saturday, May 17, 2014 2:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: As others have said, the 1.0 milestone is about API stability, not about saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can confidently build on Spark, knowing that the application they build today will still run on Spark 1.9.9 three years from now. This is something that I’ve seen done badly (and experienced the effects thereof) in other big data projects, such as MapReduce and even YARN. The result is that you annoy users, you end up with a fragmented userbase where everyone is building against a different version, and you drastically slow down development. With a project as fast-growing as fast-growing as Spark in particular, there will be new bugs discovered and reported continuously, especially in the non-core components. Look at the graph of # of contributors in time to Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when we started merging each patch as a single commit). This is not slowing down, and we need to have the culture now that we treat API stability and release numbers at the level expected for a 1.0 project instead of having people come in and randomly change the API. I’ll also note that the issues marked “blocker” were marked so by their reporters, since the reporter can set the priority. I don’t consider stuff like parallelize() not partitioning ranges in the same way as other collections a blocker — it’s a bug, it would be good to fix it, but it only affects a small number of use cases. Of course if we find a real blocker (in particular a regression from a previous version, or a feature that’s just completely broken), we will delay the release for that, but at some point you have to say “okay, this fix will go into the next maintenance release”. Maybe we need to write a clear policy for what the issue priorities mean. Finally, I believe it’s much better to have a culture where you can make releases on a regular schedule, and have the option to make a maintenance release in 3-4 days if you find new bugs, than one where you pile up stuff into each release. This is what much large project than us, like Linux, do, and it’s the only way to avoid indefinite stalling with a large contributor base. In the worst case, if you find a new bug that warrants immediate release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug fix in it). And if you find an API that you’d like to improve, just add a new one and maybe deprecate the old one — at some point we have to respect our users and let them know that code they write today will still run tomorrow. Matei On May 17, 2014, at 10:32 AM, Kan Zhang kzh...@apache.org wrote: +1 on the running commentary here, non-binding of course :-) On Sat, May 17, 2014 at 8:44 AM, Andrew Ash and...@andrewash.com wrote: +1 on the next release feeling more like a 0.10 than a 1.0 On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com wrote: I had echoed similar sentiments a while back when there was a discussion around 0.10 vs 1.0 ... I would have
Serializable different behavior Spark Shell vs. Scala Shell
Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false Spark Shell (equals uses isInstanceOf[]) class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Scala Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true
Re: Serializable different behavior Spark Shell vs. Scala Shell
Thank you for your investigation into this! Just for completeness, I've confirmed it's a problem only in REPL, not in compiled Spark programs. But within REPL, a direct consequence of non-same classes after serialization/deserialization also means that lookup() doesn't work: scala class C(val s:String) extends Serializable { | override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false | override def toString = s | } defined class C scala val r = sc.parallelize(Array((new C(a),11),(new C(a),12))) r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize at console:14 scala r.lookup(new C(a)) console:17: error: type mismatch; found : C required: C r.lookup(new C(a)) ^ On Tuesday, May 13, 2014 3:05 PM, Anand Avati av...@gluster.org wrote: On Tue, May 13, 2014 at 8:26 AM, Michael Malak michaelma...@yahoo.com wrote: Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false Spark Shell (equals uses isInstanceOf[]) class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Scala Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Hmm. I see that this can be reproduced without Spark in Scala 2.11, with and without -Yrepl-class-based command line flag to the repl. Spark's REPL has the effective behavior of 2.11's -Yrepl-class-based flag. Inspecting the byte code generated, it appears -Yrepl-class-based results in the creation of $outer field in the generated classes (including class C). The first case match in equals() is resulting code along the lines of (simplified): if (o isinstanceof Cstr this.$outer == that.$outer) { // do string compare // } $outer is the synthetic field object to the outer object in which the object was created (in this case, the repl environment). Now obviously, when x is taken through the bytestream and deserialized, it would have a new $outer created (it may have deserialized in a different jvm or machine for all we know). So the $outer's mismatching is expected. However I'm still trying to understand why the outers need to be the same for the case match.