This is not very convenient...
Thanks.
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Tuesday, June 03, 2014 11:40 AM
To: user@spark.apache.org
Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing
file
(A) Semantics in Spark 0.9
TempFile to FileA
This is not very convenient...
Thanks.
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Tuesday, June 03, 2014 11:40 AM
To: user@spark.apache.org (mailto:user@spark.apache.org)
Subject: Re: How can I make Spark 1.0
(TempFile)
4. delete FileA
5. rename TempFile to FileA
This is not very convenient...
Thanks.
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Tuesday, June 03, 2014 11:40 AM
To: user@spark.apache.org
Subject: Re: How can I make Spark 1.0 saveAsTextFile
@spark.apache.org (mailto:user@spark.apache.org)
Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite
existing
file
(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's
output
format check and overwrite files in the destination directory
(TempFile)
4. delete FileA
5. rename TempFile to FileA
This is not very convenient...
Thanks.
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Tuesday, June 03, 2014 11:40 AM
To: user@spark.apache.org
Subject: Re: How can I make Spark 1.0 saveAsTextFile
, 2014 11:40 AM
To: user@spark.apache.org
Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing
file
(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's output
format check and overwrite files in the destination directory.
But it won't clobber the directory entirely
Ah, the output directory check was just not executed in the past. I
thought it deleted the files. A third way indeed.
FWIW I also think (B) is best. (A) and (C) both have their risks, but
if they're non-default and everyone's willing to entertain a new arg
to the API method, sure. (A) seems more
Hi,
Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to
throw org.apache.hadoop.mapred.FileAlreadyExistsException when file already
exists.
Is there a way I can allow Spark to overwrite the existing file?
Cheers,
Kexin
+1 Same question here...
Message sent from a mobile device - excuse typos and abbreviations
Le 2 juin 2014 à 10:08, Kexin Xie kexin@bigcommerce.com a écrit :
Hi,
Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw
The function saveAsTextFile
https://github.com/apache/spark/blob/7d9cc9214bd06495f6838e355331dd2b5f1f7407/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1066
is
a wrapper around saveAsHadoopFile
Hi Michaël,
Thanks for this. We could indeed do that.
But I guess the question is more about the change of behaviour from 0.9.1 to
1.0.0.
We never had to care about that in previous versions.
Does that mean we have to manually remove existing files or is there a way
to aumotically overwrite
Indeed, the behavior has changed for good or for bad. I mean, I agree with the
danger you mention but I'm not sure it's happening like that. Isn't there a
mechanism for overwrite in Hadoop that automatically removes part files, then
writes a _temporary folder and then only the part files along
Hey There,
The issue was that the old behavior could cause users to silently
overwrite data, which is pretty bad, so to be conservative we decided
to enforce the same checks that Hadoop does.
This was documented by this JIRA:
https://issues.apache.org/jira/browse/SPARK-1100
Hi, Patrick,
I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about the
same thing?
How about assigning it to me?
I think I missed the configuration part in my previous commit, though I
declared that in the PR description….
Best,
--
Nan Zhu
On Monday, June 2,
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.
On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, Patrick,
I think
+1 please re-add this feature
On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote:
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.
On Mon, Jun 2, 2014 at
So in summary:
- As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
- There is an open JIRA issue to add an option to allow clobbering.
- Even when clobbering, part- files may be left over from previous
saves, which is dangerous.
Is this correct?
On Mon, Jun 2,
Yes.
On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
So in summary:
- As of Spark 1.0.0, saveAsTextFile() will no longer clobber by
default.
- There is an open JIRA issue to add an option to allow clobbering.
- Even when clobbering, part-
OK, thanks for confirming. Is there something we can do about that leftover
part- files problem in Spark, or is that for the Hadoop team?
2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지:
Yes.
On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I'm a bit confused because the PR mentioned by Patrick seems to adress all
these issues:
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
Was it not accepted? Or is the description of this PR not completely
implemented?
Message sent from a mobile device - excuse
I assume the idea is for Spark to rm -r dir/, which would clean out
everything that was there before. It's just doing this instead of the
caller. Hadoop still won't let you write into a location that already
exists regardless, and part of that is for this reason that you might
end up with files
Fair enough. That rationale makes sense.
I would prefer that a Spark clobber option also delete the destination
files, but as long as it's a non-default option I can see the caller
beware side of that argument as well.
Nick
2014년 6월 2일 월요일, Sean Owenso...@cloudera.com님이 작성한 메시지:
I assume the
I made the PR, the problem is …after many rounds of review, that configuration
part is missed….sorry about that
I will fix it
Best,
--
Nan Zhu
On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote:
I'm a bit confused because the PR mentioned by Patrick seems to adress all
We can just add back a flag to make it backwards compatible - it was
just missed during the original PR.
Adding a *third* set of clobber semantics, I'm slightly -1 on that
for the following reasons:
1. It's scary to have Spark recursively deleting user files, could
easily lead to users deleting
Is there a third way? Unless I miss something. Hadoop's OutputFormat
wants the target dir to not exist no matter what, so it's just a
question of whether Spark deletes it for you or errors.
On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell pwend...@gmail.com wrote:
We can just add back a flag to
(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's
output format check and overwrite files in the destination directory.
But it won't clobber the directory entirely. I.e. if the directory
already had part1 part2 part3 part4 and you write a new job
outputing only two files (part1,
I remember that in the earlier version of that PR, I deleted files by calling
HDFS API
we discussed and concluded that, it’s a bit scary to have something directly
deleting user’s files in Spark
Best,
--
Nan Zhu
On Monday, June 2, 2014 at 10:39 PM, Patrick Wendell wrote:
(A) Semantics
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote:
(B) Semantics in Spark 1.0 and earlier:
Do you mean 1.0 and later?
Option (B) with the exception-on-clobber sounds fine to me, btw. My use
pattern is probably common but not universal, and deleting user files is
indeed
+1 on Option (B) with flag to allow semantics in (A) for back compatibility.
Kexin
On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com
wrote:
(B) Semantics in Spark 1.0 and earlier:
Do
Good catch! Yes I meant 1.0 and later.
On Mon, Jun 2, 2014 at 8:33 PM, Kexin Xie kexin@bigcommerce.com wrote:
+1 on Option (B) with flag to allow semantics in (A) for back compatibility.
Kexin
On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
On Mon,
30 matches
Mail list logo