Re: 答复: 答复: RDD usage

hequn cheng Tue, 25 Mar 2014 03:52:56 -0700

Hi~I wrote a program to test.The non-idempotent "compute" function in
foreach does change the value of RDD. It may looks a little crazy to do so
since modify the RDD will make it impossible to keep RDD fault-tolerant in
spark :)




2014-03-25 11:11 GMT+08:00 林武康 <vboylin1...@gmail.com>:

>  Hi hequn, I dig into the source of spark a bit deeper, and I got some
> ideas, firstly, immutable is a feather of rdd but not a solid rule, there
> are ways to change it, for excample, a rdd with non-idempotent "compute"
> function, though it is really a bad design to make that function
> non-idempotent for uncontrollable side-effect. I agree with Mark that
> foreach can modify the elements of a rdd, but we should avoid this because
> it will effect all the rdds generate by this changed rdd , make the whole
> process inconsistent and unstable.
>
> Some rough opinions on the immutable feature of rdd, full discuss can make
> it more clear. Any ideas?
>  ------------------------------
> 发件人: hequn cheng <chenghe...@gmail.com>
> 发送时间: 2014/3/25 10:40
> 收件人: user@spark.apache.org
> 主题: Re: 答复: RDD usage
>
>  First question:
> If you save your modified RDD like this:
> points.foreach(p=>p.y = another_value).collect() or
> points.foreach(p=>p.y = another_value).saveAsTextFile(...)
> the modified RDD will be materialized and this will not use any work's
> memory.
> If you have more transformatins after the map(), the spark will pipelines
> all transformations and build a DAG. Very little memory will be used in
> this stage and the memory will be free soon.
> Only cache() will persist your RDD in memory for a long time.
> Second question:
> Once RDD be created, it can not be changed due to the immutable
> feature.You can only create a new RDD from the existing RDD or from file
> system.
>
>
> 2014-03-25 9:45 GMT+08:00 林武康 <vboylin1...@gmail.com>:
>
>>  Hi hequn, a relative question, is that mean the memory usage will
>> doubled? And further more, if the compute function in a rdd is not
>> idempotent, rdd will changed during the job running, is that right?
>>  ------------------------------
>> 发件人: hequn cheng <chenghe...@gmail.com>
>> 发送时间: 2014/3/25 9:35
>> 收件人: user@spark.apache.org
>> 主题: Re: RDD usage
>>
>>  points.foreach(p=>p.y = another_value) will return a new modified RDD.
>>
>>
>> 2014-03-24 18:13 GMT+08:00 Chieh-Yen <r01944...@csie.ntu.edu.tw>:
>>
>>>  Dear all,
>>>
>>> I have a question about the usage of RDD.
>>> I implemented a class called AppDataPoint, it looks like:
>>>
>>> case class AppDataPoint(input_y : Double, input_x : Array[Double])
>>> extends Serializable {
>>>   var y : Double = input_y
>>>   var x : Array[Double] = input_x
>>>   ......
>>> }
>>> Furthermore, I created the RDD by the following function.
>>>
>>> def parsePoint(line: String): AppDataPoint = {
>>>   /* Some related works for parsing */
>>>   ......
>>> }
>>>
>>> Assume the RDD called "points":
>>>
>>> val lines = sc.textFile(inputPath, numPartition)
>>> var points = lines.map(parsePoint _).cache()
>>>
>>> The question is that, I tried to modify the value of this RDD, the
>>> operation is:
>>>
>>> points.foreach(p=>p.y = another_value)
>>>
>>> The operation is workable.
>>> There doesn't have any warning or error message showed by the system and
>>> the results are right.
>>> I wonder that if the modification for RDD is a correct and in fact
>>> workable design.
>>> The usage web said that the RDD is immutable, is there any suggestion?
>>>
>>> Thanks a lot.
>>>
>>> Chieh-Yen Lin
>>>
>>
>>
>

Re: 答复: 答复: RDD usage

Reply via email to