Re: 答复: 答复: RDD usage

2014-03-29 Thread Chieh-Yen
Got it.
Thanks for your help!!

Chieh-Yen


On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng chenghe...@gmail.com wrote:

 Hi~I wrote a program to test.The non-idempotent compute function in
 foreach does change the value of RDD. It may looks a little crazy to do so
 since modify the RDD will make it impossible to keep RDD fault-tolerant
 in spark :)



 2014-03-25 11:11 GMT+08:00 林武康 vboylin1...@gmail.com:

  Hi hequn, I dig into the source of spark a bit deeper, and I got some
 ideas, firstly, immutable is a feather of rdd but not a solid rule, there
 are ways to change it, for excample, a rdd with non-idempotent compute
 function, though it is really a bad design to make that function
 non-idempotent for uncontrollable side-effect. I agree with Mark that
 foreach can modify the elements of a rdd, but we should avoid this because
 it will effect all the rdds generate by this changed rdd , make the whole
 process inconsistent and unstable.

 Some rough opinions on the immutable feature of rdd, full discuss can
 make it more clear. Any ideas?
  --
 发件人: hequn cheng chenghe...@gmail.com
 发送时间: 2014/3/25 10:40

 收件人: user@spark.apache.org
 主题: Re: 答复: RDD usage

  First question:
 If you save your modified RDD like this:
 points.foreach(p=p.y = another_value).collect() or
 points.foreach(p=p.y = another_value).saveAsTextFile(...)
 the modified RDD will be materialized and this will not use any work's
 memory.
 If you have more transformatins after the map(), the spark will pipelines
 all transformations and build a DAG. Very little memory will be used in
 this stage and the memory will be free soon.
 Only cache() will persist your RDD in memory for a long time.
 Second question:
 Once RDD be created, it can not be changed due to the immutable
 feature.You can only create a new RDD from the existing RDD or from file
 system.


 2014-03-25 9:45 GMT+08:00 林武康 vboylin1...@gmail.com:

  Hi hequn, a relative question, is that mean the memory usage will
 doubled? And further more, if the compute function in a rdd is not
 idempotent, rdd will changed during the job running, is that right?
  --
 发件人: hequn cheng chenghe...@gmail.com
 发送时间: 2014/3/25 9:35
 收件人: user@spark.apache.org
 主题: Re: RDD usage

  points.foreach(p=p.y = another_value) will return a new modified RDD.


 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw:

  Dear all,

 I have a question about the usage of RDD.
 I implemented a class called AppDataPoint, it looks like:

 case class AppDataPoint(input_y : Double, input_x : Array[Double])
 extends Serializable {
   var y : Double = input_y
   var x : Array[Double] = input_x
   ..
 }
 Furthermore, I created the RDD by the following function.

 def parsePoint(line: String): AppDataPoint = {
   /* Some related works for parsing */
   ..
 }

 Assume the RDD called points:

 val lines = sc.textFile(inputPath, numPartition)
 var points = lines.map(parsePoint _).cache()

 The question is that, I tried to modify the value of this RDD, the
 operation is:

 points.foreach(p=p.y = another_value)

 The operation is workable.
 There doesn't have any warning or error message showed by the system
 and the results are right.
 I wonder that if the modification for RDD is a correct and in fact
 workable design.
 The usage web said that the RDD is immutable, is there any suggestion?

 Thanks a lot.

 Chieh-Yen Lin







答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, a relative question, is that mean the memory usage will doubled? And 
further more, if the compute function in a rdd is not idempotent, rdd will 
changed during the job running, is that right? 

-原始邮件-
发件人: hequn cheng chenghe...@gmail.com
发送时间: ‎2014/‎3/‎25 9:35
收件人: user@spark.apache.org user@spark.apache.org
主题: Re: RDD usage

points.foreach(p=p.y = another_value) will return a new modified RDD. 




2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw:

Dear all,


I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:


case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends 
Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
  ..
}
Furthermore, I created the RDD by the following function.


def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */
  ..
}


Assume the RDD called points:


val lines = sc.textFile(inputPath, numPartition)
var points = lines.map(parsePoint _).cache()


The question is that, I tried to modify the value of this RDD, the operation is:


points.foreach(p=p.y = another_value)


The operation is workable.
There doesn't have any warning or error message showed by the system and the 
results are right.
I wonder that if the modification for RDD is a correct and in fact workable 
design.
The usage web said that the RDD is immutable, is there any suggestion?


Thanks a lot.


Chieh-Yen Lin

Re: 答复: RDD usage

2014-03-24 Thread hequn cheng
First question:
If you save your modified RDD like this:
points.foreach(p=p.y = another_value).collect() or
points.foreach(p=p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's
memory.
If you have more transformatins after the map(), the spark will pipelines
all transformations and build a DAG. Very little memory will be used in
this stage and the memory will be free soon.
Only cache() will persist your RDD in memory for a long time.
Second question:
Once RDD be created, it can not be changed due to the immutable feature.You
can only create a new RDD from the existing RDD or from file system.


2014-03-25 9:45 GMT+08:00 林武康 vboylin1...@gmail.com:

  Hi hequn, a relative question, is that mean the memory usage will
 doubled? And further more, if the compute function in a rdd is not
 idempotent, rdd will changed during the job running, is that right?
  --
 发件人: hequn cheng chenghe...@gmail.com
 发送时间: 2014/3/25 9:35
 收件人: user@spark.apache.org
 主题: Re: RDD usage

 points.foreach(p=p.y = another_value) will return a new modified RDD.


 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw:

  Dear all,

 I have a question about the usage of RDD.
 I implemented a class called AppDataPoint, it looks like:

 case class AppDataPoint(input_y : Double, input_x : Array[Double])
 extends Serializable {
   var y : Double = input_y
   var x : Array[Double] = input_x
   ..
 }
 Furthermore, I created the RDD by the following function.

 def parsePoint(line: String): AppDataPoint = {
   /* Some related works for parsing */
   ..
 }

 Assume the RDD called points:

 val lines = sc.textFile(inputPath, numPartition)
 var points = lines.map(parsePoint _).cache()

 The question is that, I tried to modify the value of this RDD, the
 operation is:

 points.foreach(p=p.y = another_value)

 The operation is workable.
 There doesn't have any warning or error message showed by the system and
 the results are right.
 I wonder that if the modification for RDD is a correct and in fact
 workable design.
 The usage web said that the RDD is immutable, is there any suggestion?

 Thanks a lot.

 Chieh-Yen Lin





答复: 答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, 
firstly, immutable is a feather of rdd but not a solid rule, there are ways to 
change it, for excample, a rdd with non-idempotent compute function, though 
it is really a bad design to make that function non-idempotent for 
uncontrollable side-effect. I agree with Mark that foreach can modify the 
elements of a rdd, but we should avoid this because it will effect all the rdds 
generate by this changed rdd , make the whole process inconsistent and unstable.

Some rough opinions on the immutable feature of rdd, full discuss can make it 
more clear. Any ideas?

-原始邮件-
发件人: hequn cheng chenghe...@gmail.com
发送时间: ‎2014/‎3/‎25 10:40
收件人: user@spark.apache.org user@spark.apache.org
主题: Re: 答复: RDD usage

First question:
If you save your modified RDD like this:
points.foreach(p=p.y = another_value).collect() or 
points.foreach(p=p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's memory.
If you have more transformatins after the map(), the spark will pipelines all 
transformations and build a DAG. Very little memory will be used in this stage 
and the memory will be free soon.
Only cache() will persist your RDD in memory for a long time.
Second question:
Once RDD be created, it can not be changed due to the immutable feature.You can 
only create a new RDD from the existing RDD or from file system.



2014-03-25 9:45 GMT+08:00 林武康 vboylin1...@gmail.com:

Hi hequn, a relative question, is that mean the memory usage will doubled? And 
further more, if the compute function in a rdd is not idempotent, rdd will 
changed during the job running, is that right? 


发件人: hequn cheng
发送时间: 2014/3/25 9:35
收件人: user@spark.apache.org
主题: Re: RDD usage


points.foreach(p=p.y = another_value) will return a new modified RDD. 




2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw:

Dear all,


I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:


case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends 
Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
  ..
}
Furthermore, I created the RDD by the following function.


def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */
  ..
}


Assume the RDD called points:


val lines = sc.textFile(inputPath, numPartition)
var points = lines.map(parsePoint _).cache()


The question is that, I tried to modify the value of this RDD, the operation is:


points.foreach(p=p.y = another_value)


The operation is workable.
There doesn't have any warning or error message showed by the system and the 
results are right.
I wonder that if the modification for RDD is a correct and in fact workable 
design.
The usage web said that the RDD is immutable, is there any suggestion?


Thanks a lot.


Chieh-Yen Lin