Thanks for your answer, Ryan. In the short term we'll only have INSERT INTO/OVERWRITE for Icebeg tables in Impala (so the way to go is AppendFiles and ReplacePartitions accordingly). But I agree that a MERGE INTO statement would be super useful. Hopefully we'll add support for it as well in the not too distant future.
Thanks again, Zoltan On Fri, Jan 29, 2021 at 8:19 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Zoltan, > > The warning is that dynamic overwrites in general aren't recommended. > ReplacePartitions is the right operation to use for dynamic overwrite, we > just want to steer users away from dynamic overwrites in general. > > The problem with dynamic overwrite is that its behavior depends on the > underlying physical structure of the table. If your table is partitioned by > days, then it will overwrite by day. If it is partitioned by hours, it will > overwrite by hour. That means that the behavior changes when table > partitioning changes. That's a bad thing because we want SQL operations to > be logical operations that do not depend on the table layout. > > OverwriteFiles is a better option because what gets overwritten is > explicit. If you want to overwrite a day, you pass a filter for that day. > Another way around this problem is to support MERGE INTO, which will detect > the files that need to be changed and correctly rewrite them, wherever they > are in the table. > > rb > > On Fri, Jan 29, 2021 at 10:14 AM Zoltán Borók-Nagy > <borokna...@cloudera.com.invalid> wrote: > >> Hey everyone, >> >> I'm currently working on the INSERT OVERWRITE statement for Iceberg >> tables in Impala. >> >> Seems like ReplacePartitions is the perfect interface for this job: >> https://github.infra.cloudera.com/CDH/iceberg/blob/cdpd-master/api/src/main/java/org/apache/iceberg/ReplacePartitions.java >> >> IIUC Spark also uses this interface for dynamic overwrites. Though the >> class comment says that this interface is not recommended to use, and use >> OverwriteFiles instead. OverwriteFiles is more generic and more explicit, >> but for this task it would need extra boilerplate code, therefore more >> possibilities to do stg wrong. >> >> So my question is, is there any problem with using ReplacePartitions for >> dynamic overwrites? I see that it only replaces partitions with the current >> partition spec, but that's probably fine. Otherwise handling partition >> layout evolution and dynamic inserts can be complicated and error-prone >> anyway. >> >> Apart from which interface to use, is there anything I should be aware >> of? E.g. I guess we don't want to allow dynamic overwrites if the table is >> partitioned by the BUCKET transform. >> >> Thanks, >> Zoltan >> >> > > -- > Ryan Blue > Software Engineer > Netflix >