Re: 答复: 答复: RDD usage

2014-03-29 Thread Chieh-Yen
Got it.
Thanks for your help!!

Chieh-Yen


On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng  wrote:

> Hi~I wrote a program to test.The non-idempotent "compute" function in
> foreach does change the value of RDD. It may looks a little crazy to do so
> since modify the RDD will make it impossible to keep RDD fault-tolerant
> in spark :)
>
>
>
> 2014-03-25 11:11 GMT+08:00 林武康 :
>
>>  Hi hequn, I dig into the source of spark a bit deeper, and I got some
>> ideas, firstly, immutable is a feather of rdd but not a solid rule, there
>> are ways to change it, for excample, a rdd with non-idempotent "compute"
>> function, though it is really a bad design to make that function
>> non-idempotent for uncontrollable side-effect. I agree with Mark that
>> foreach can modify the elements of a rdd, but we should avoid this because
>> it will effect all the rdds generate by this changed rdd , make the whole
>> process inconsistent and unstable.
>>
>> Some rough opinions on the immutable feature of rdd, full discuss can
>> make it more clear. Any ideas?
>>  --
>> 发件人: hequn cheng 
>> 发送时间: 2014/3/25 10:40
>>
>> 收件人: user@spark.apache.org
>> 主题: Re: 答复: RDD usage
>>
>>  First question:
>> If you save your modified RDD like this:
>> points.foreach(p=>p.y = another_value).collect() or
>> points.foreach(p=>p.y = another_value).saveAsTextFile(...)
>> the modified RDD will be materialized and this will not use any work's
>> memory.
>> If you have more transformatins after the map(), the spark will pipelines
>> all transformations and build a DAG. Very little memory will be used in
>> this stage and the memory will be free soon.
>> Only cache() will persist your RDD in memory for a long time.
>> Second question:
>> Once RDD be created, it can not be changed due to the immutable
>> feature.You can only create a new RDD from the existing RDD or from file
>> system.
>>
>>
>> 2014-03-25 9:45 GMT+08:00 林武康 :
>>
>>>  Hi hequn, a relative question, is that mean the memory usage will
>>> doubled? And further more, if the compute function in a rdd is not
>>> idempotent, rdd will changed during the job running, is that right?
>>>  --
>>> 发件人: hequn cheng 
>>> 发送时间: 2014/3/25 9:35
>>> 收件人: user@spark.apache.org
>>> 主题: Re: RDD usage
>>>
>>>  points.foreach(p=>p.y = another_value) will return a new modified RDD.
>>>
>>>
>>> 2014-03-24 18:13 GMT+08:00 Chieh-Yen :
>>>
  Dear all,

 I have a question about the usage of RDD.
 I implemented a class called AppDataPoint, it looks like:

 case class AppDataPoint(input_y : Double, input_x : Array[Double])
 extends Serializable {
   var y : Double = input_y
   var x : Array[Double] = input_x
   ..
 }
 Furthermore, I created the RDD by the following function.

 def parsePoint(line: String): AppDataPoint = {
   /* Some related works for parsing */
   ..
 }

 Assume the RDD called "points":

 val lines = sc.textFile(inputPath, numPartition)
 var points = lines.map(parsePoint _).cache()

 The question is that, I tried to modify the value of this RDD, the
 operation is:

 points.foreach(p=>p.y = another_value)

 The operation is workable.
 There doesn't have any warning or error message showed by the system
 and the results are right.
 I wonder that if the modification for RDD is a correct and in fact
 workable design.
 The usage web said that the RDD is immutable, is there any suggestion?

 Thanks a lot.

 Chieh-Yen Lin

>>>
>>>
>>
>


Re: 答复: 答复: RDD usage

2014-03-25 Thread hequn cheng
Hi~I wrote a program to test.The non-idempotent "compute" function in
foreach does change the value of RDD. It may looks a little crazy to do so
since modify the RDD will make it impossible to keep RDD fault-tolerant in
spark :)



2014-03-25 11:11 GMT+08:00 林武康 :

>  Hi hequn, I dig into the source of spark a bit deeper, and I got some
> ideas, firstly, immutable is a feather of rdd but not a solid rule, there
> are ways to change it, for excample, a rdd with non-idempotent "compute"
> function, though it is really a bad design to make that function
> non-idempotent for uncontrollable side-effect. I agree with Mark that
> foreach can modify the elements of a rdd, but we should avoid this because
> it will effect all the rdds generate by this changed rdd , make the whole
> process inconsistent and unstable.
>
> Some rough opinions on the immutable feature of rdd, full discuss can make
> it more clear. Any ideas?
>  --
> 发件人: hequn cheng 
> 发送时间: 2014/3/25 10:40
> 收件人: user@spark.apache.org
> 主题: Re: 答复: RDD usage
>
>  First question:
> If you save your modified RDD like this:
> points.foreach(p=>p.y = another_value).collect() or
> points.foreach(p=>p.y = another_value).saveAsTextFile(...)
> the modified RDD will be materialized and this will not use any work's
> memory.
> If you have more transformatins after the map(), the spark will pipelines
> all transformations and build a DAG. Very little memory will be used in
> this stage and the memory will be free soon.
> Only cache() will persist your RDD in memory for a long time.
> Second question:
> Once RDD be created, it can not be changed due to the immutable
> feature.You can only create a new RDD from the existing RDD or from file
> system.
>
>
> 2014-03-25 9:45 GMT+08:00 林武康 :
>
>>  Hi hequn, a relative question, is that mean the memory usage will
>> doubled? And further more, if the compute function in a rdd is not
>> idempotent, rdd will changed during the job running, is that right?
>>  --
>> 发件人: hequn cheng 
>> 发送时间: 2014/3/25 9:35
>> 收件人: user@spark.apache.org
>> 主题: Re: RDD usage
>>
>>  points.foreach(p=>p.y = another_value) will return a new modified RDD.
>>
>>
>> 2014-03-24 18:13 GMT+08:00 Chieh-Yen :
>>
>>>  Dear all,
>>>
>>> I have a question about the usage of RDD.
>>> I implemented a class called AppDataPoint, it looks like:
>>>
>>> case class AppDataPoint(input_y : Double, input_x : Array[Double])
>>> extends Serializable {
>>>   var y : Double = input_y
>>>   var x : Array[Double] = input_x
>>>   ..
>>> }
>>> Furthermore, I created the RDD by the following function.
>>>
>>> def parsePoint(line: String): AppDataPoint = {
>>>   /* Some related works for parsing */
>>>   ..
>>> }
>>>
>>> Assume the RDD called "points":
>>>
>>> val lines = sc.textFile(inputPath, numPartition)
>>> var points = lines.map(parsePoint _).cache()
>>>
>>> The question is that, I tried to modify the value of this RDD, the
>>> operation is:
>>>
>>> points.foreach(p=>p.y = another_value)
>>>
>>> The operation is workable.
>>> There doesn't have any warning or error message showed by the system and
>>> the results are right.
>>> I wonder that if the modification for RDD is a correct and in fact
>>> workable design.
>>> The usage web said that the RDD is immutable, is there any suggestion?
>>>
>>> Thanks a lot.
>>>
>>> Chieh-Yen Lin
>>>
>>
>>
>


答复: 答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, 
firstly, immutable is a feather of rdd but not a solid rule, there are ways to 
change it, for excample, a rdd with non-idempotent "compute" function, though 
it is really a bad design to make that function non-idempotent for 
uncontrollable side-effect. I agree with Mark that foreach can modify the 
elements of a rdd, but we should avoid this because it will effect all the rdds 
generate by this changed rdd , make the whole process inconsistent and unstable.

Some rough opinions on the immutable feature of rdd, full discuss can make it 
more clear. Any ideas?

-原始邮件-
发件人: "hequn cheng" 
发送时间: ‎2014/‎3/‎25 10:40
收件人: "user@spark.apache.org" 
主题: Re: 答复: RDD usage

First question:
If you save your modified RDD like this:
points.foreach(p=>p.y = another_value).collect() or 
points.foreach(p=>p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's memory.
If you have more transformatins after the map(), the spark will pipelines all 
transformations and build a DAG. Very little memory will be used in this stage 
and the memory will be free soon.
Only cache() will persist your RDD in memory for a long time.
Second question:
Once RDD be created, it can not be changed due to the immutable feature.You can 
only create a new RDD from the existing RDD or from file system.



2014-03-25 9:45 GMT+08:00 林武康 :

Hi hequn, a relative question, is that mean the memory usage will doubled? And 
further more, if the compute function in a rdd is not idempotent, rdd will 
changed during the job running, is that right? 


发件人: hequn cheng
发送时间: 2014/3/25 9:35
收件人: user@spark.apache.org
主题: Re: RDD usage


points.foreach(p=>p.y = another_value) will return a new modified RDD. 




2014-03-24 18:13 GMT+08:00 Chieh-Yen :

Dear all,


I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:


case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends 
Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
  ..
}
Furthermore, I created the RDD by the following function.


def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */
  ..
}


Assume the RDD called "points":


val lines = sc.textFile(inputPath, numPartition)
var points = lines.map(parsePoint _).cache()


The question is that, I tried to modify the value of this RDD, the operation is:


points.foreach(p=>p.y = another_value)


The operation is workable.
There doesn't have any warning or error message showed by the system and the 
results are right.
I wonder that if the modification for RDD is a correct and in fact workable 
design.
The usage web said that the RDD is immutable, is there any suggestion?


Thanks a lot.


Chieh-Yen Lin