Re: Overwrite support from ParquetIO

Alexey Romanenko Thu, 28 Jan 2021 09:14:19 -0800

1. Personally, I’d recommend to purge the output directory (if it’s needed, of 
course) before starting your pipeline as a part of your driver program and not 
in DoFn since, as Reuven mentioned before, to avoid potential side effects. 
Another option is to write files into the new directory with uniq name and 
then, after your pipeline has been finished, atomically rename it. Though, of 
course the final solution depends on internals of your application and 
environment.


Imho, FS manipulations (like this) should be a part of driver program and not a 
distributed data processing pipeline where it can be quite tricky to do 
reliably.  
 
2. Yes, for sure we can’t rely on the fact that the old files will be 
overwritten by new files. Even more, we need to make sure that they won’t be 
overwritten to guarantee that we won’t lose them unexpectedly.

> On 27 Jan 2021, at 21:06, Tao Li <[email protected]> wrote:
> 
> @Alexey Romanenko <mailto:[email protected]> thanks for your response. 
> Regarding your questions:
>  
> Yes I can purge this directory (e.g. using s3 client from aws sdk) before 
> using ParquetIO to save files. The caveat is that this deletion operation is 
> not part of the beam pipeline, so it will kick off before the pipeline 
> starts. More ideally, this purge operation could be baked into the write 
> operation with ParquetIO so we will have the deletion happen right before the 
> files writes.
> Regarding the naming strategy, yes the old files will be overwritten by the 
> new files if they have the same file names. However this does not always 
> guarantee that all the old files in this directory are wiped out (which is 
> actually my requirement). For example we may change the shard count (through 
> withNumShards() method) in different pipeline runs and there could be old 
> files from previous run that won’t get overwritten in the current run. 
>  
> Please let me know if this makes sense to you. Thanks!
>  
>  
> From: Alexey Romanenko <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, January 27, 2021 at 9:10 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: Overwrite support from ParquetIO
>  
> What do you mean by “wipe out all existing parquet files before a write 
> operation”? Are these all files that already exist in the same output 
> directory? Can you purge this directory before or just use a new output 
> directory for every pipeline run?
>  
> To write Parquet files you need to use ParquetIO.sink() with FileIO.write() 
> and I don’t think it will clean up the output directory before write. Though, 
> if there are the name collisions between existing and new output files (it 
> depends on used naming strategy) then I think the old files will be 
> overwritten by new ones. 
>  
>  
> 
> 
>> On 25 Jan 2021, at 19:10, Tao Li <[email protected] <mailto:[email protected]>> 
>> wrote:
>>  
>> Hi Beam community,
>>  
>> Does ParquetIO support an overwrite behavior when saving files? More 
>> specifically, I would like to wipe out all existing parquet files before a 
>> write operation. Is there a ParquetIO API to support that? Thanks!

Re: Overwrite support from ParquetIO

Reply via email to