Re: Question about SaveMode.Ignore behaviour

Juho Autio Fri, 10 May 2019 11:16:23 -0700

Never mind, I noticed that mode=ignore doesn't write anything if target
path exists. Even if files previously exist only in different partitions
than the one's being written to.


So, ignore mode can't be used to mitigate the FileAlreadyExistsException
problem of append mode..

On Thu, May 9, 2019 at 4:28 PM Juho Autio <juho.au...@rovio.com> wrote:

> Does spark handle 'ignore' mode on file level or partition level?
>
>
> My code is like this:
>
>     df.write \
>         .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \
>         .mode('ignore') \
>         .partitionBy('p') \
>         .orc(target_path)
>
> When I used mode('append') my job sometimes fails
> with FileAlreadyExistsException when a failed task is retried. So I would
> like to skip writing if the file by the same name exists already, and if it
> doesn't exist, write the file, of course.
>
>
> Here's an example scenario that I'm not sure about:
>
> Let's say I have parallelism=2, and there's data in both splits, going to
> the same output partition. So the expected result after the job has run is
> that two files were created:
>
> target_path/p=1/output-abc-1.orc
> target_path/p=1/output-abc-2.orc
>
> I know that this works in the normal case. But is there something special
> to know in case of any failed task attempts?
>
> For example, consider that the first task tries to write, but the task
> fails due to some reason, although the file is successfully created at:
>
> target_path/p=1/output-abc-1.orc
>
> Then the task is retried, but it turns out that actually the file was
> already written. So writing it is ignored this time. All good so far.
>
> Then there's another task that should write the 2nd file (out of 2):
>
> target_path/p=1/output-abc-2.orc
>
> Is this file written, or would it be ignored because target_path/p=1/
> already exists?
>
>
> Extract from Spark docs:
>
> > Ignore mode means that when saving a DataFrame to a data source, if data
> already exists, the save operation is expected to not save the contents of
> the DataFrame and to not change the existing data
>
> So here the meaning of "data" is a bit ambiguous. I've read that for sql
> table writing with SaveMode.Ignore writing would be skipped entirely if the
> table exists. It seems like it has a different meaning for a regular
> df.write?
>
>
> Thanks!
>

Re: Question about SaveMode.Ignore behaviour

Reply via email to