Re: 答复: 答复: RDD usage
Got it. Thanks for your help!! Chieh-Yen On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng wrote: > Hi~I wrote a program to test.The non-idempotent "compute" function in > foreach does change the value of RDD. It may looks a little crazy to do so > since modify the RDD will make it impossible to keep RDD fault-tolerant > in spark :) > > > > 2014-03-25 11:11 GMT+08:00 林武康 : > >> Hi hequn, I dig into the source of spark a bit deeper, and I got some >> ideas, firstly, immutable is a feather of rdd but not a solid rule, there >> are ways to change it, for excample, a rdd with non-idempotent "compute" >> function, though it is really a bad design to make that function >> non-idempotent for uncontrollable side-effect. I agree with Mark that >> foreach can modify the elements of a rdd, but we should avoid this because >> it will effect all the rdds generate by this changed rdd , make the whole >> process inconsistent and unstable. >> >> Some rough opinions on the immutable feature of rdd, full discuss can >> make it more clear. Any ideas? >> -- >> 发件人: hequn cheng >> 发送时间: 2014/3/25 10:40 >> >> 收件人: user@spark.apache.org >> 主题: Re: 答复: RDD usage >> >> First question: >> If you save your modified RDD like this: >> points.foreach(p=>p.y = another_value).collect() or >> points.foreach(p=>p.y = another_value).saveAsTextFile(...) >> the modified RDD will be materialized and this will not use any work's >> memory. >> If you have more transformatins after the map(), the spark will pipelines >> all transformations and build a DAG. Very little memory will be used in >> this stage and the memory will be free soon. >> Only cache() will persist your RDD in memory for a long time. >> Second question: >> Once RDD be created, it can not be changed due to the immutable >> feature.You can only create a new RDD from the existing RDD or from file >> system. >> >> >> 2014-03-25 9:45 GMT+08:00 林武康 : >> >>> Hi hequn, a relative question, is that mean the memory usage will >>> doubled? And further more, if the compute function in a rdd is not >>> idempotent, rdd will changed during the job running, is that right? >>> -- >>> 发件人: hequn cheng >>> 发送时间: 2014/3/25 9:35 >>> 收件人: user@spark.apache.org >>> 主题: Re: RDD usage >>> >>> points.foreach(p=>p.y = another_value) will return a new modified RDD. >>> >>> >>> 2014-03-24 18:13 GMT+08:00 Chieh-Yen : >>> Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x .. } Furthermore, I created the RDD by the following function. def parsePoint(line: String): AppDataPoint = { /* Some related works for parsing */ .. } Assume the RDD called "points": val lines = sc.textFile(inputPath, numPartition) var points = lines.map(parsePoint _).cache() The question is that, I tried to modify the value of this RDD, the operation is: points.foreach(p=>p.y = another_value) The operation is workable. There doesn't have any warning or error message showed by the system and the results are right. I wonder that if the modification for RDD is a correct and in fact workable design. The usage web said that the RDD is immutable, is there any suggestion? Thanks a lot. Chieh-Yen Lin >>> >>> >> >
Re: 答复: 答复: RDD usage
Hi~I wrote a program to test.The non-idempotent "compute" function in foreach does change the value of RDD. It may looks a little crazy to do so since modify the RDD will make it impossible to keep RDD fault-tolerant in spark :) 2014-03-25 11:11 GMT+08:00 林武康 : > Hi hequn, I dig into the source of spark a bit deeper, and I got some > ideas, firstly, immutable is a feather of rdd but not a solid rule, there > are ways to change it, for excample, a rdd with non-idempotent "compute" > function, though it is really a bad design to make that function > non-idempotent for uncontrollable side-effect. I agree with Mark that > foreach can modify the elements of a rdd, but we should avoid this because > it will effect all the rdds generate by this changed rdd , make the whole > process inconsistent and unstable. > > Some rough opinions on the immutable feature of rdd, full discuss can make > it more clear. Any ideas? > -- > 发件人: hequn cheng > 发送时间: 2014/3/25 10:40 > 收件人: user@spark.apache.org > 主题: Re: 答复: RDD usage > > First question: > If you save your modified RDD like this: > points.foreach(p=>p.y = another_value).collect() or > points.foreach(p=>p.y = another_value).saveAsTextFile(...) > the modified RDD will be materialized and this will not use any work's > memory. > If you have more transformatins after the map(), the spark will pipelines > all transformations and build a DAG. Very little memory will be used in > this stage and the memory will be free soon. > Only cache() will persist your RDD in memory for a long time. > Second question: > Once RDD be created, it can not be changed due to the immutable > feature.You can only create a new RDD from the existing RDD or from file > system. > > > 2014-03-25 9:45 GMT+08:00 林武康 : > >> Hi hequn, a relative question, is that mean the memory usage will >> doubled? And further more, if the compute function in a rdd is not >> idempotent, rdd will changed during the job running, is that right? >> -- >> 发件人: hequn cheng >> 发送时间: 2014/3/25 9:35 >> 收件人: user@spark.apache.org >> 主题: Re: RDD usage >> >> points.foreach(p=>p.y = another_value) will return a new modified RDD. >> >> >> 2014-03-24 18:13 GMT+08:00 Chieh-Yen : >> >>> Dear all, >>> >>> I have a question about the usage of RDD. >>> I implemented a class called AppDataPoint, it looks like: >>> >>> case class AppDataPoint(input_y : Double, input_x : Array[Double]) >>> extends Serializable { >>> var y : Double = input_y >>> var x : Array[Double] = input_x >>> .. >>> } >>> Furthermore, I created the RDD by the following function. >>> >>> def parsePoint(line: String): AppDataPoint = { >>> /* Some related works for parsing */ >>> .. >>> } >>> >>> Assume the RDD called "points": >>> >>> val lines = sc.textFile(inputPath, numPartition) >>> var points = lines.map(parsePoint _).cache() >>> >>> The question is that, I tried to modify the value of this RDD, the >>> operation is: >>> >>> points.foreach(p=>p.y = another_value) >>> >>> The operation is workable. >>> There doesn't have any warning or error message showed by the system and >>> the results are right. >>> I wonder that if the modification for RDD is a correct and in fact >>> workable design. >>> The usage web said that the RDD is immutable, is there any suggestion? >>> >>> Thanks a lot. >>> >>> Chieh-Yen Lin >>> >> >> >
答复: 答复: RDD usage
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, firstly, immutable is a feather of rdd but not a solid rule, there are ways to change it, for excample, a rdd with non-idempotent "compute" function, though it is really a bad design to make that function non-idempotent for uncontrollable side-effect. I agree with Mark that foreach can modify the elements of a rdd, but we should avoid this because it will effect all the rdds generate by this changed rdd , make the whole process inconsistent and unstable. Some rough opinions on the immutable feature of rdd, full discuss can make it more clear. Any ideas? -原始邮件- 发件人: "hequn cheng" 发送时间: 2014/3/25 10:40 收件人: "user@spark.apache.org" 主题: Re: 答复: RDD usage First question: If you save your modified RDD like this: points.foreach(p=>p.y = another_value).collect() or points.foreach(p=>p.y = another_value).saveAsTextFile(...) the modified RDD will be materialized and this will not use any work's memory. If you have more transformatins after the map(), the spark will pipelines all transformations and build a DAG. Very little memory will be used in this stage and the memory will be free soon. Only cache() will persist your RDD in memory for a long time. Second question: Once RDD be created, it can not be changed due to the immutable feature.You can only create a new RDD from the existing RDD or from file system. 2014-03-25 9:45 GMT+08:00 林武康 : Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? 发件人: hequn cheng 发送时间: 2014/3/25 9:35 收件人: user@spark.apache.org 主题: Re: RDD usage points.foreach(p=>p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen : Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x .. } Furthermore, I created the RDD by the following function. def parsePoint(line: String): AppDataPoint = { /* Some related works for parsing */ .. } Assume the RDD called "points": val lines = sc.textFile(inputPath, numPartition) var points = lines.map(parsePoint _).cache() The question is that, I tried to modify the value of this RDD, the operation is: points.foreach(p=>p.y = another_value) The operation is workable. There doesn't have any warning or error message showed by the system and the results are right. I wonder that if the modification for RDD is a correct and in fact workable design. The usage web said that the RDD is immutable, is there any suggestion? Thanks a lot. Chieh-Yen Lin