Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Hi,

We have such requirements to save RDD output to HDFS with saveAsTextFile like 
API, but need to overwrite the data if existed. I'm not sure if current Spark 
support such kind of operations, or I need to check this manually?

There's a thread in mailing list discussed about this 
(http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?

Appreciate your suggestions.

Thanks a lot
Jerry


RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration 
for this purpose. What do you think Patrick?

Cheng Hao

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Thursday, December 25, 2014 3:22 PM
To: Shao, Saisai
Cc: u...@spark.apache.org; dev@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with 
 saveAsTextFile like API, but need to overwrite the data if existed. 
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this 
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
 ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
So the behavior of overwriting existing directories IMO is something
we don't want to encourage. The reason why the Hadoop client has these
checks is that it's very easy for users to do unsafe things without
them. For instance, a user could overwrite an RDD that had 100
partitions with an RDD that has 10 partitions... and if they read back
the RDD they would get a corrupted RDD that has a combination of data
from the old and new RDD.

If users want to circumvent these safety checks, we need to make them
explicitly disable them. Given this, I think a config option is as
reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote:
 I am wondering if we can provide more friendly API, other than configuration 
 for this purpose. What do you think Patrick?

 Cheng Hao

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 25, 2014 3:22 PM
 To: Shao, Saisai
 Cc: u...@spark.apache.org; dev@spark.apache.org
 Subject: Re: Question on saveAsTextFile with overwrite option

 Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

 http://spark.apache.org/docs/latest/configuration.html

 - Patrick

 On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with
 saveAsTextFile like API, but need to overwrite the data if existed.
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
 ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Thanks Patrick for your detailed explanation.

BR
Jerry

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Thursday, December 25, 2014 3:43 PM
To: Cheng, Hao
Cc: Shao, Saisai; u...@spark.apache.org; dev@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

So the behavior of overwriting existing directories IMO is something we don't 
want to encourage. The reason why the Hadoop client has these checks is that 
it's very easy for users to do unsafe things without them. For instance, a user 
could overwrite an RDD that had 100 partitions with an RDD that has 10 
partitions... and if they read back the RDD they would get a corrupted RDD that 
has a combination of data from the old and new RDD.

If users want to circumvent these safety checks, we need to make them 
explicitly disable them. Given this, I think a config option is as reasonable 
as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote:
 I am wondering if we can provide more friendly API, other than configuration 
 for this purpose. What do you think Patrick?

 Cheng Hao

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 25, 2014 3:22 PM
 To: Shao, Saisai
 Cc: u...@spark.apache.org; dev@spark.apache.org
 Subject: Re: Question on saveAsTextFile with overwrite option

 Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

 http://spark.apache.org/docs/latest/configuration.html

 - Patrick

 On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with 
 saveAsTextFile like API, but need to overwrite the data if existed.
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this 
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-S
 p ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org