RE: Java RDD Union
With that said, and the nature of iterative algorithms that Spark is advertised for, isn't this a bit of an unnecessary restriction since I don't see where the problem is. For instance, it is clear that when aggregating you need operations to be associative because of the way they are divided and combined. But since forEach works on an individual item the same problem doesn't exist. As an example, during a k-means algorithm you have to continually update cluster assignments per data item along with perhaps distance from centroid. So if you can't update items in place you have to literally create thousands upon thousands of RDDs. Does Spark have some kind of trick like reuse behind the scenes - fully persistent data objects or whatever. How can it possibly be efficient for 'iterative' algorithms when it is creating so many RDDs as opposed to one? From: so...@cloudera.com Date: Fri, 5 Dec 2014 14:58:37 -0600 Subject: Re: Java RDD Union To: ronalday...@live.com; user@spark.apache.org foreach also creates a new RDD, and does not modify an existing RDD. However, in practice, nothing stops you from fiddling with the Java objects inside an RDD when you get a reference to them in a method like this. This is definitely a bad idea, as there is certainly no guarantee that any other operations will see any, some or all of these edits. On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub ronalday...@live.com wrote: I tricked myself into thinking it was uniting things correctly. I see I'm wrong now. I have a question regarding your comment that RDD are immutable. Can you change values in an RDD using forEach. Does that violate immutability. I've been using forEach to modify RDD but perhaps I've tricked myself once again into believing it is working. I have object reference so perhaps it is working serendipitously in local mode since the references are in fact not changing but there are referents are and somehow this will no longer work when clustering. Thanks for comments. From: so...@cloudera.com Date: Fri, 5 Dec 2014 14:22:38 -0600 Subject: Re: Java RDD Union To: ronalday...@live.com CC: user@spark.apache.org No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub ronalday...@live.com wrote: I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting documents appropriately upon examination. It appears that Union simply adds items in the argument to the RDD instance the method is called on rather than just returning a new RDD. If I want to do Union this was as more of an add/append should I be capturing the return value and releasing it from memory. Need help clarifying the semantics here. Also, in another related thread someone mentioned coalesce after union. Would I need to do the same on the instance RDD I'm calling Union on. Perhaps a method such as append would be useful and clearer. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Java RDD Union
I guess a major problem with this is that you lose fault tolerance. You have no way of recreating the local state of the mutable RDD if a partition is lost. Why would you need thousands of RDDs for kmeans? it's a few per iteration. An RDD is more bookkeeping that data structure, itself. They don't inherently take up resource, unless you mark them to be persisted. You're paying the cost of copying objects to create one RDD from next, but that's mostly it. On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub ronalday...@live.com wrote: With that said, and the nature of iterative algorithms that Spark is advertised for, isn't this a bit of an unnecessary restriction since I don't see where the problem is. For instance, it is clear that when aggregating you need operations to be associative because of the way they are divided and combined. But since forEach works on an individual item the same problem doesn't exist. As an example, during a k-means algorithm you have to continually update cluster assignments per data item along with perhaps distance from centroid. So if you can't update items in place you have to literally create thousands upon thousands of RDDs. Does Spark have some kind of trick like reuse behind the scenes - fully persistent data objects or whatever. How can it possibly be efficient for 'iterative' algorithms when it is creating so many RDDs as opposed to one? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Java RDD Union
Hiearchical K-means require a massive amount of iterations whereas flat K-means does not but I've found flat to be generally useless since in most UIs it is nice to be able to drill down into more and more specific clusters. If you have 100 million documents and your branching factor is 8 (8-secting k-means) then you will be picking a cluster to split and iterating thousands of times. So per split you iterate maybe 6 or 7 times to get new cluster assignments and there are ultimately going to be 5,000 to 50,000 splits depending on split criterion and cluster variances etc... In this case fault tolerance doesn't matter. I've found that the distributed aspect of RDD is what I'm looking for and don't care or need the resilience part as much. It is a one off algorithm and that can just be run again if something goes wrong. Once the data is created it is done with Spark. But anyway, that is the very thing Spark is advertised for. From: so...@cloudera.com Date: Sat, 6 Dec 2014 06:39:10 -0600 Subject: Re: Java RDD Union To: ronalday...@live.com CC: user@spark.apache.org I guess a major problem with this is that you lose fault tolerance. You have no way of recreating the local state of the mutable RDD if a partition is lost. Why would you need thousands of RDDs for kmeans? it's a few per iteration. An RDD is more bookkeeping that data structure, itself. They don't inherently take up resource, unless you mark them to be persisted. You're paying the cost of copying objects to create one RDD from next, but that's mostly it. On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub ronalday...@live.com wrote: With that said, and the nature of iterative algorithms that Spark is advertised for, isn't this a bit of an unnecessary restriction since I don't see where the problem is. For instance, it is clear that when aggregating you need operations to be associative because of the way they are divided and combined. But since forEach works on an individual item the same problem doesn't exist. As an example, during a k-means algorithm you have to continually update cluster assignments per data item along with perhaps distance from centroid. So if you can't update items in place you have to literally create thousands upon thousands of RDDs. Does Spark have some kind of trick like reuse behind the scenes - fully persistent data objects or whatever. How can it possibly be efficient for 'iterative' algorithms when it is creating so many RDDs as opposed to one? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Java RDD Union
No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub ronalday...@live.com wrote: I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting documents appropriately upon examination. It appears that Union simply adds items in the argument to the RDD instance the method is called on rather than just returning a new RDD. If I want to do Union this was as more of an add/append should I be capturing the return value and releasing it from memory. Need help clarifying the semantics here. Also, in another related thread someone mentioned coalesce after union. Would I need to do the same on the instance RDD I'm calling Union on. Perhaps a method such as append would be useful and clearer. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Java RDD Union
Hi Ron, Out of curiosity, why do you think that union is modifying an existing RDD in place? In general all transformations, including union, will create new RDDs, not modify old RDDs in place. Here's a quick test: scala val firstRDD = sc.parallelize(1 to 5) firstRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at console:12 scala val secondRDD = sc.parallelize(1 to 3) secondRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at console:12 scala firstRDD.collect() res1: Array[Int] = Array(1, 2, 3, 4, 5) scala secondRDD.collect() res2: Array[Int] = Array(1, 2, 3) scala val newRDD = firstRDD.union(secondRDD) newRDD: org.apache.spark.rdd.RDD[Int] = UnionRDD[4] at union at console:16 scala newRDD.collect() res3: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3) scala firstRDD.collect() res4: Array[Int] = Array(1, 2, 3, 4, 5) scala secondRDD.collect() res5: Array[Int] = Array(1, 2, 3) On Fri, Dec 5, 2014 at 2:27 PM, Ron Ayoub ronalday...@live.com wrote: I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting documents appropriately upon examination. It appears that Union simply adds items in the argument to the RDD instance the method is called on rather than just returning a new RDD. If I want to do Union this was as more of an add/append should I be capturing the return value and releasing it from memory. Need help clarifying the semantics here. Also, in another related thread someone mentioned coalesce after union. Would I need to do the same on the instance RDD I'm calling Union on. Perhaps a method such as append would be useful and clearer.
Re: Java RDD Union
foreach also creates a new RDD, and does not modify an existing RDD. However, in practice, nothing stops you from fiddling with the Java objects inside an RDD when you get a reference to them in a method like this. This is definitely a bad idea, as there is certainly no guarantee that any other operations will see any, some or all of these edits. On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub ronalday...@live.com wrote: I tricked myself into thinking it was uniting things correctly. I see I'm wrong now. I have a question regarding your comment that RDD are immutable. Can you change values in an RDD using forEach. Does that violate immutability. I've been using forEach to modify RDD but perhaps I've tricked myself once again into believing it is working. I have object reference so perhaps it is working serendipitously in local mode since the references are in fact not changing but there are referents are and somehow this will no longer work when clustering. Thanks for comments. From: so...@cloudera.com Date: Fri, 5 Dec 2014 14:22:38 -0600 Subject: Re: Java RDD Union To: ronalday...@live.com CC: user@spark.apache.org No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub ronalday...@live.com wrote: I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting documents appropriately upon examination. It appears that Union simply adds items in the argument to the RDD instance the method is called on rather than just returning a new RDD. If I want to do Union this was as more of an add/append should I be capturing the return value and releasing it from memory. Need help clarifying the semantics here. Also, in another related thread someone mentioned coalesce after union. Would I need to do the same on the instance RDD I'm calling Union on. Perhaps a method such as append would be useful and clearer. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org