Re: 答复: 答复: RDD usage
Got it. Thanks for your help!! Chieh-Yen On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng chenghe...@gmail.com wrote: Hi~I wrote a program to test.The non-idempotent compute function in foreach does change the value of RDD. It may looks a little crazy to do so since modify the RDD will make it impossible to keep RDD fault-tolerant in spark :) 2014-03-25 11:11 GMT+08:00 林武康 vboylin1...@gmail.com: Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, firstly, immutable is a feather of rdd but not a solid rule, there are ways to change it, for excample, a rdd with non-idempotent compute function, though it is really a bad design to make that function non-idempotent for uncontrollable side-effect. I agree with Mark that foreach can modify the elements of a rdd, but we should avoid this because it will effect all the rdds generate by this changed rdd , make the whole process inconsistent and unstable. Some rough opinions on the immutable feature of rdd, full discuss can make it more clear. Any ideas? -- 发件人: hequn cheng chenghe...@gmail.com 发送时间: 2014/3/25 10:40 收件人: user@spark.apache.org 主题: Re: 答复: RDD usage First question: If you save your modified RDD like this: points.foreach(p=p.y = another_value).collect() or points.foreach(p=p.y = another_value).saveAsTextFile(...) the modified RDD will be materialized and this will not use any work's memory. If you have more transformatins after the map(), the spark will pipelines all transformations and build a DAG. Very little memory will be used in this stage and the memory will be free soon. Only cache() will persist your RDD in memory for a long time. Second question: Once RDD be created, it can not be changed due to the immutable feature.You can only create a new RDD from the existing RDD or from file system. 2014-03-25 9:45 GMT+08:00 林武康 vboylin1...@gmail.com: Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? -- 发件人: hequn cheng chenghe...@gmail.com 发送时间: 2014/3/25 9:35 收件人: user@spark.apache.org 主题: Re: RDD usage points.foreach(p=p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw: Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x .. } Furthermore, I created the RDD by the following function. def parsePoint(line: String): AppDataPoint = { /* Some related works for parsing */ .. } Assume the RDD called points: val lines = sc.textFile(inputPath, numPartition) var points = lines.map(parsePoint _).cache() The question is that, I tried to modify the value of this RDD, the operation is: points.foreach(p=p.y = another_value) The operation is workable. There doesn't have any warning or error message showed by the system and the results are right. I wonder that if the modification for RDD is a correct and in fact workable design. The usage web said that the RDD is immutable, is there any suggestion? Thanks a lot. Chieh-Yen Lin
答复: RDD usage
Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? -原始邮件- 发件人: hequn cheng chenghe...@gmail.com 发送时间: 2014/3/25 9:35 收件人: user@spark.apache.org user@spark.apache.org 主题: Re: RDD usage points.foreach(p=p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw: Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x .. } Furthermore, I created the RDD by the following function. def parsePoint(line: String): AppDataPoint = { /* Some related works for parsing */ .. } Assume the RDD called points: val lines = sc.textFile(inputPath, numPartition) var points = lines.map(parsePoint _).cache() The question is that, I tried to modify the value of this RDD, the operation is: points.foreach(p=p.y = another_value) The operation is workable. There doesn't have any warning or error message showed by the system and the results are right. I wonder that if the modification for RDD is a correct and in fact workable design. The usage web said that the RDD is immutable, is there any suggestion? Thanks a lot. Chieh-Yen Lin
Re: 答复: RDD usage
First question: If you save your modified RDD like this: points.foreach(p=p.y = another_value).collect() or points.foreach(p=p.y = another_value).saveAsTextFile(...) the modified RDD will be materialized and this will not use any work's memory. If you have more transformatins after the map(), the spark will pipelines all transformations and build a DAG. Very little memory will be used in this stage and the memory will be free soon. Only cache() will persist your RDD in memory for a long time. Second question: Once RDD be created, it can not be changed due to the immutable feature.You can only create a new RDD from the existing RDD or from file system. 2014-03-25 9:45 GMT+08:00 林武康 vboylin1...@gmail.com: Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? -- 发件人: hequn cheng chenghe...@gmail.com 发送时间: 2014/3/25 9:35 收件人: user@spark.apache.org 主题: Re: RDD usage points.foreach(p=p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw: Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x .. } Furthermore, I created the RDD by the following function. def parsePoint(line: String): AppDataPoint = { /* Some related works for parsing */ .. } Assume the RDD called points: val lines = sc.textFile(inputPath, numPartition) var points = lines.map(parsePoint _).cache() The question is that, I tried to modify the value of this RDD, the operation is: points.foreach(p=p.y = another_value) The operation is workable. There doesn't have any warning or error message showed by the system and the results are right. I wonder that if the modification for RDD is a correct and in fact workable design. The usage web said that the RDD is immutable, is there any suggestion? Thanks a lot. Chieh-Yen Lin
答复: 答复: RDD usage
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, firstly, immutable is a feather of rdd but not a solid rule, there are ways to change it, for excample, a rdd with non-idempotent compute function, though it is really a bad design to make that function non-idempotent for uncontrollable side-effect. I agree with Mark that foreach can modify the elements of a rdd, but we should avoid this because it will effect all the rdds generate by this changed rdd , make the whole process inconsistent and unstable. Some rough opinions on the immutable feature of rdd, full discuss can make it more clear. Any ideas? -原始邮件- 发件人: hequn cheng chenghe...@gmail.com 发送时间: 2014/3/25 10:40 收件人: user@spark.apache.org user@spark.apache.org 主题: Re: 答复: RDD usage First question: If you save your modified RDD like this: points.foreach(p=p.y = another_value).collect() or points.foreach(p=p.y = another_value).saveAsTextFile(...) the modified RDD will be materialized and this will not use any work's memory. If you have more transformatins after the map(), the spark will pipelines all transformations and build a DAG. Very little memory will be used in this stage and the memory will be free soon. Only cache() will persist your RDD in memory for a long time. Second question: Once RDD be created, it can not be changed due to the immutable feature.You can only create a new RDD from the existing RDD or from file system. 2014-03-25 9:45 GMT+08:00 林武康 vboylin1...@gmail.com: Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? 发件人: hequn cheng 发送时间: 2014/3/25 9:35 收件人: user@spark.apache.org 主题: Re: RDD usage points.foreach(p=p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw: Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x .. } Furthermore, I created the RDD by the following function. def parsePoint(line: String): AppDataPoint = { /* Some related works for parsing */ .. } Assume the RDD called points: val lines = sc.textFile(inputPath, numPartition) var points = lines.map(parsePoint _).cache() The question is that, I tried to modify the value of this RDD, the operation is: points.foreach(p=p.y = another_value) The operation is workable. There doesn't have any warning or error message showed by the system and the results are right. I wonder that if the modification for RDD is a correct and in fact workable design. The usage web said that the RDD is immutable, is there any suggestion? Thanks a lot. Chieh-Yen Lin