Question on saveAsTextFile with overwrite option
Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry
RE: Question on saveAsTextFile with overwrite option
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: u...@spark.apache.org; dev@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Question on saveAsTextFile with overwrite option
So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that has 10 partitions... and if they read back the RDD they would get a corrupted RDD that has a combination of data from the old and new RDD. If users want to circumvent these safety checks, we need to make them explicitly disable them. Given this, I think a config option is as reasonable as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote: I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: u...@spark.apache.org; dev@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Question on saveAsTextFile with overwrite option
Thanks Patrick for your detailed explanation. BR Jerry -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:43 PM To: Cheng, Hao Cc: Shao, Saisai; u...@spark.apache.org; dev@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that has 10 partitions... and if they read back the RDD they would get a corrupted RDD that has a combination of data from the old and new RDD. If users want to circumvent these safety checks, we need to make them explicitly disable them. Given this, I think a config option is as reasonable as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote: I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: u...@spark.apache.org; dev@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-S p ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org