I update parquet files as follows:

A. First save your data in row groups.
B. Modify any row groups by removing DELETED records. Delete the row group from 
the parquet file and append the modified row group to the file.
C. Add any new INSERTS as a new row group appended to the file.

Alternative you could create a BOOLEAN column and flag the deleted record as 
TRUE. It's a lot less work to just modify a single column of data in a parquet 
row group than to remove row from all columnar stores.

-----Original Message-----
From: Nicolas PARIS <[email protected]> 
Sent: Thursday, February 27, 2020 1:04 PM
To: [email protected]
Subject: Re: Patterns for data updating?

External Email: Use caution with links and attachments


> However, updating parquet files can be a bit troublesome.  The files 
> cannot easily be appended to.  So some process has to periodically 
> re-write the parquet files.  Also, we don't want to have hundreds or 
> thousands of separate files, as this can slow down query executing.
> So we don't want to end up with a new file every 10 seconds.  What I 
> have been thinking is to have a process that runs which writes changes 
> fairly frequently to small new files and another process that rolls up 
> those small files into progressively larger ones as they get older.
> When querying the data I will have to de-duplicate and keep only the 
> most recent version of each record, which I think is possible using 
> window functions.  Thus the file aggregation process might not have to 
> worry about having the exact same row in two files temporarily.  I'm 
> wondering if anyone has gone down this road before and has insights to 
> share about it.

You might be interested in delta-lake which provides an implementation of the 
sql merge statement on top of parquet files. Implementing a drill connector on 
this should be feasible. This could be used together the hybrid design 
described by Ted and Paul - and makes parquet be more than static archive.

https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.delta.io_latest_delta-2Dintro.html&d=DwIBAg&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=ePTQFDH9X4C-YvnvPrFzXq8jWshnoqSML5cqceHpz4A&s=aQgI-eLpXDqoBVtcxQyyOzfXE60pZ7QPLF7i56T7SPc&e=

--
nicolas paris


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2020 BlackRock, Inc. All rights reserved.

Reply via email to