I update parquet files as follows: A. First save your data in row groups. B. Modify any row groups by removing DELETED records. Delete the row group from the parquet file and append the modified row group to the file. C. Add any new INSERTS as a new row group appended to the file.
Alternative you could create a BOOLEAN column and flag the deleted record as TRUE. It's a lot less work to just modify a single column of data in a parquet row group than to remove row from all columnar stores. -----Original Message----- From: Nicolas PARIS <[email protected]> Sent: Thursday, February 27, 2020 1:04 PM To: [email protected] Subject: Re: Patterns for data updating? External Email: Use caution with links and attachments > However, updating parquet files can be a bit troublesome. The files > cannot easily be appended to. So some process has to periodically > re-write the parquet files. Also, we don't want to have hundreds or > thousands of separate files, as this can slow down query executing. > So we don't want to end up with a new file every 10 seconds. What I > have been thinking is to have a process that runs which writes changes > fairly frequently to small new files and another process that rolls up > those small files into progressively larger ones as they get older. > When querying the data I will have to de-duplicate and keep only the > most recent version of each record, which I think is possible using > window functions. Thus the file aggregation process might not have to > worry about having the exact same row in two files temporarily. I'm > wondering if anyone has gone down this road before and has insights to > share about it. You might be interested in delta-lake which provides an implementation of the sql merge statement on top of parquet files. Implementing a drill connector on this should be feasible. This could be used together the hybrid design described by Ted and Paul - and makes parquet be more than static archive. https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.delta.io_latest_delta-2Dintro.html&d=DwIBAg&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=ePTQFDH9X4C-YvnvPrFzXq8jWshnoqSML5cqceHpz4A&s=aQgI-eLpXDqoBVtcxQyyOzfXE60pZ7QPLF7i56T7SPc&e= -- nicolas paris This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2020 BlackRock, Inc. All rights reserved.
