Re: 答复: 答复: RDD usage

Chieh-Yen Sat, 29 Mar 2014 00:27:07 -0700

Got it.
Thanks for your help!!

Chieh-Yen



On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng <chenghe...@gmail.com> wrote:

> Hi~I wrote a program to test.The non-idempotent "compute" function in
> foreach does change the value of RDD. It may looks a little crazy to do so
> since modify the RDD will make it impossible to keep RDD fault-tolerant
> in spark :)
>
>
>
> 2014-03-25 11:11 GMT+08:00 林武康 <vboylin1...@gmail.com>:
>
>>  Hi hequn, I dig into the source of spark a bit deeper, and I got some
>> ideas, firstly, immutable is a feather of rdd but not a solid rule, there
>> are ways to change it, for excample, a rdd with non-idempotent "compute"
>> function, though it is really a bad design to make that function
>> non-idempotent for uncontrollable side-effect. I agree with Mark that
>> foreach can modify the elements of a rdd, but we should avoid this because
>> it will effect all the rdds generate by this changed rdd , make the whole
>> process inconsistent and unstable.
>>
>> Some rough opinions on the immutable feature of rdd, full discuss can
>> make it more clear. Any ideas?
>>  ------------------------------
>> 发件人: hequn cheng <chenghe...@gmail.com>
>> 发送时间: 2014/3/25 10:40
>>
>> 收件人: user@spark.apache.org
>> 主题: Re: 答复: RDD usage
>>
>>  First question:
>> If you save your modified RDD like this:
>> points.foreach(p=>p.y = another_value).collect() or
>> points.foreach(p=>p.y = another_value).saveAsTextFile(...)
>> the modified RDD will be materialized and this will not use any work's
>> memory.
>> If you have more transformatins after the map(), the spark will pipelines
>> all transformations and build a DAG. Very little memory will be used in
>> this stage and the memory will be free soon.
>> Only cache() will persist your RDD in memory for a long time.
>> Second question:
>> Once RDD be created, it can not be changed due to the immutable
>> feature.You can only create a new RDD from the existing RDD or from file
>> system.
>>
>>
>> 2014-03-25 9:45 GMT+08:00 林武康 <vboylin1...@gmail.com>:
>>
>>>  Hi hequn, a relative question, is that mean the memory usage will
>>> doubled? And further more, if the compute function in a rdd is not
>>> idempotent, rdd will changed during the job running, is that right?
>>>  ------------------------------
>>> 发件人: hequn cheng <chenghe...@gmail.com>
>>> 发送时间: 2014/3/25 9:35
>>> 收件人: user@spark.apache.org
>>> 主题: Re: RDD usage
>>>
>>>  points.foreach(p=>p.y = another_value) will return a new modified RDD.
>>>
>>>
>>> 2014-03-24 18:13 GMT+08:00 Chieh-Yen <r01944...@csie.ntu.edu.tw>:
>>>
>>>>  Dear all,
>>>>
>>>> I have a question about the usage of RDD.
>>>> I implemented a class called AppDataPoint, it looks like:
>>>>
>>>> case class AppDataPoint(input_y : Double, input_x : Array[Double])
>>>> extends Serializable {
>>>>   var y : Double = input_y
>>>>   var x : Array[Double] = input_x
>>>>   ......
>>>> }
>>>> Furthermore, I created the RDD by the following function.
>>>>
>>>> def parsePoint(line: String): AppDataPoint = {
>>>>   /* Some related works for parsing */
>>>>   ......
>>>> }
>>>>
>>>> Assume the RDD called "points":
>>>>
>>>> val lines = sc.textFile(inputPath, numPartition)
>>>> var points = lines.map(parsePoint _).cache()
>>>>
>>>> The question is that, I tried to modify the value of this RDD, the
>>>> operation is:
>>>>
>>>> points.foreach(p=>p.y = another_value)
>>>>
>>>> The operation is workable.
>>>> There doesn't have any warning or error message showed by the system
>>>> and the results are right.
>>>> I wonder that if the modification for RDD is a correct and in fact
>>>> workable design.
>>>> The usage web said that the RDD is immutable, is there any suggestion?
>>>>
>>>> Thanks a lot.
>>>>
>>>> Chieh-Yen Lin
>>>>
>>>
>>>
>>
>

Re: 答复: 答复: RDD usage

Reply via email to