Hi Tyler,

I think upsert is a good alternative, concise as INSERT and have the valid
semantics. Just that user seems rarely use UPSERT either(might because
there's no UPDATE in batch big data processing).

By *"INSERT will behave differently in batch & stream processing"* I mean,
if we use the "INSERT" solution I described above, there will be ten
INSERTs:

*INSERT INTO result(rowkey, col1) values(...)*

*INSERT INTO result(rowkey, col2) values(...)*

*...INSERT INTO result(rowkey, col10) values(...)*

Although we issued ten INSERTs, but there will be only ONE new records in
the target table, because 9 of the INSERTs are actually UPDATing the
record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
batch,* INSERT is just INSERT*.

I think the essence of this problem is, there is no UPDATE in batch, but
require UPDATE in streaming.



Tyler Akidau <taki...@google.com>于2017年6月22日周四 下午11:35写道:

> Calcite appears to have UPSERT
> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
> use that instead?
>
> Also, I don't understand your statement that "INSERT will behave
> differently in batch & stream processing". Can you explain further?
>
>
> -Tyler
>
>
> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
> wrote:
>
>> If I'm understanding correctly, Hive does that with a insert into
>> followed by a select statement that does the aggregation.
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>>
>> On Thu, Jun 22, 2017 at 1:32 AM James <xumingmi...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I am thinking about a SQL and stream computing related problem, want to
>>> hear your opinions.
>>>
>>> In stream computing, there is a typical case like this:
>>>
>>> *We want to calculate a big wide result table, which has one rowkey and
>>> ten
>>> value columns:*
>>> *create table result (*
>>> *    rowkey varchar(127) PRIMARY KEY,*
>>> *    col1 int,*
>>> *    col2 int,*
>>> *    ...*
>>> *    col10 int*
>>> *);*
>>>
>>> Each of the value columns is calculated by a complex query, so there will
>>> be ten SQLs to calculate
>>> data for this table, for each sql:
>>>
>>> * First check whether there is a row for the specified `rowkey`.
>>> * If yes, then `update`, otherwise `insert`.
>>>
>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>> this(SQL2008), a sample usage is:
>>>
>>> MERGE INTO result D
>>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>>    ON (D.rowkey = S.rowkey)
>>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>
>>>
>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>> rarely used this syntax.*
>>>
>>> So my colleagues invented a new syntax for this scenario (Or more
>>> precisely, a new interpretation for the INSERT statement). For the above
>>> scenario, user will always write `insert` statement:
>>>
>>> insert into result(rowkey, col1) values(...)
>>> insert into result(rowkey, col2) values(...)
>>>
>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>> exists, then update, otherwise `insert`. This solution is very concise,
>>> but
>>> violates the semantics of `insert`, using this solution INSERT will
>>> behave
>>> differently in batch & stream processing.
>>>
>>> How do you guys think? which do you prefer? What's your reasoning?
>>>
>>> Looking forward to your opinions, thanks in advance.
>>>
>> --
>> Thanks,
>>
>> Jesse
>>
>

Reply via email to