Re: how to do "merge into" (unsupported yet in sql) ?

Arnaud Nauwynck Thu, 15 Oct 2020 12:22:12 -0700

Hi,

even simple "update" SQL are not supported yet in iceberg.

spark.sql("""
UPDATE db1.iceberg_table2 t
SET t.data = 'b1'
WHERE t.id = 2
""").show()

=> error
java.lang.UnsupportedOperationException: UPDATE TABLE is not supported
temporarily.
  at
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:794)

It is not clear from Iceberg API or Doc site  how to perform simple update
/ delete .. and how to convert from given sql to API if possible.

I do see there is an api call "newRewrite()" but it looks very low level,
requiring me to scan files, then rewrite new files content and remove
previous files if modified.
... How to efficiently implement the "if-modified" condition when the where
clause is a complex combination of "and-or-not-operators" ? For example, to
rely on predicate-push-down wherever possible, and to avoid comparing
previous/new file content for any effective change ?

Example:
UPDATE db1.iceberg_table2 t SET t.data = 'b1' WHERE ( t.id = 2)  ... <=
might use predicate-push-down from parquet, to blindly skip whole files, or
blindly copy whole RowGroup
UPDATE db1.iceberg_table2 t SET t.data = 'b1' WHERE ( t.id like '%abc%')
.... <= no applyable predicate-push-down from parquet... but at runtime,
some files might not be modified after scanning the "id" column content

Is there at least some helper class to do simple file Parquet
transformations, with copy by SQL "select" only, or spark DataSet.map() ?
from previous example, the new file might be generated by this sql

   sql= " SELECT id, IF( t.id like '%abc%, 'b1', t.data)  FROM
tempTableForFileToUpdate  "

Is this (below code) the kind of API call that we are supposed to do to
perform updates ??

  FileScan ... map( (parquetFile) -> {
     if (noPredicatePushDownForFile( parquetFile,  "id like %abc%" )) {
       val df = sqlContext.read.parquet( parquetFile )
       df.registerTempTable("tempTableForFileToUpdate")
       spark.sql(sq).write().save( newTempFileToAdd )
       if (compareFileEffectivelyDifferent( parquetFile ,  newTempFileToAdd
)) {
          ... add to transaction newUpdate:  fileToRemove=parquetFile,
fileToAdd= newTempFileToAdd
       } else {
         ... delete temporarily created newFileToAdd
       }
   }
}

It looks scary and error prone to perform simple UPDATE like this, doesn't
it ?  I hope there is a better way, and I did not find it in the
documentation

Regards,
Arnaud

Le jeu. 15 oct. 2020 à 16:26, Ashish Mehta <mehta.ashis...@gmail.com> a
écrit :

> Hi,
>
> I am also trying to achieve something similar. The Merge INTO format has
> multiple "when matched/not matched" condition, and usually, you can take
> action like "delete" or "update" or "insert", can I do that by "overwrite
> part of the destination table with the replacement"? Also, the
> recommendation of overwriting, will it use UPSERT, or are you trying to
> overwrite everything on the target table?
>
> Till now I have been able to use Iceberg API directly for UPSERT, I
> believe there is no way I can do this via dataFrame operations, as a single
> commit.
>
> Thanks,
> Ashish
>
>
> On Tue, Oct 13, 2020 at 12:24 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi Arnaud,
>>
>> You're right that MERGE INTO isn't supported yet. What I've seen most
>> people do is to implement the operation using SQL to select existing data
>> and join it with new data, then overwrite part of the destination table
>> with the replacement.
>>
>> On Mon, Oct 12, 2020 at 2:10 PM Arnaud Nauwynck <
>> arnaud.nauwy...@gmail.com> wrote:
>>
>>> Hi Iceberg dev team,
>>>
>>> I am trying to use iceberg to do "upsert" based on an event table
>>> In pure SQL, it is unsupported yet
>>>
>>> Here is my example with 2 tables: "iceberg_table" should contain updated
>>> data, and "table_event" contains event updates.
>>>
>>> scala> spark.sql(""" MERGE INTO db1.iceberg_table t USING
>>> db1.table_event e ON e.id = t.id WHEN MATCHED THEN   UPDATE SET t.data
>>> = e.data WHEN NOT MATCHED   THEN INSERT (id, data) VALUES (id, data)
>>> """).show()
>>> ..
>>> java.lang.UnsupportedOperationException: MERGE INTO TABLE is not
>>> supported temporarily.
>>>
>>> Is there a way to execute it programmatically using Spark Api or Iceberg
>>> Api?
>>> Any idea when the sql feature might be available ?
>>>
>>>
>>> Thanks in advance
>>>
>>> Regards,
>>> Arnaud Nauwynck
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: how to do "merge into" (unsupported yet in sql) ?

Reply via email to