Would like to share my thoughts in another perspective. IMO this is a
typical scenario for column based databases, like Hbase/Cassandra. You may
need to choose a right database if possible.

UPSERT is another alternative option, but I wouldn't suggest to a
customized check-insert/check-update implementation. The actual job should
be done in database side.

On Thu, Jun 22, 2017 at 6:59 PM, James <xumingmi...@gmail.com> wrote:

> Hi Tyler,
>
> I think upsert is a good alternative, concise as INSERT and have the valid
> semantics. Just that user seems rarely use UPSERT either(might because
> there's no UPDATE in batch big data processing).
>
> By *"INSERT will behave differently in batch & stream processing"* I
> mean, if we use the "INSERT" solution I described above, there will be ten
> INSERTs:
>
> *INSERT INTO result(rowkey, col1) values(...)*
>
> *INSERT INTO result(rowkey, col2) values(...)*
>
> *...INSERT INTO result(rowkey, col10) values(...)*
>
> Although we issued ten INSERTs, but there will be only ONE new records in
> the target table, because 9 of the INSERTs are actually UPDATing the
> record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
> batch,* INSERT is just INSERT*.
>
> I think the essence of this problem is, there is no UPDATE in batch, but
> require UPDATE in streaming.
>
>
>
> Tyler Akidau <taki...@google.com>于2017年6月22日周四 下午11:35写道:
>
>> Calcite appears to have UPSERT
>> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
>> use that instead?
>>
>> Also, I don't understand your statement that "INSERT will behave
>> differently in batch & stream processing". Can you explain further?
>>
>>
>> -Tyler
>>
>>
>> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
>> wrote:
>>
>>> If I'm understanding correctly, Hive does that with a insert into
>>> followed by a select statement that does the aggregation. https://cwiki.
>>> apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-
>>> InsertingdataintoHiveTablesfromqueries
>>>
>>> On Thu, Jun 22, 2017 at 1:32 AM James <xumingmi...@gmail.com> wrote:
>>>
>>>> Hi team,
>>>>
>>>> I am thinking about a SQL and stream computing related problem, want to
>>>> hear your opinions.
>>>>
>>>> In stream computing, there is a typical case like this:
>>>>
>>>> *We want to calculate a big wide result table, which has one rowkey and
>>>> ten
>>>> value columns:*
>>>> *create table result (*
>>>> *    rowkey varchar(127) PRIMARY KEY,*
>>>> *    col1 int,*
>>>> *    col2 int,*
>>>> *    ...*
>>>> *    col10 int*
>>>> *);*
>>>>
>>>> Each of the value columns is calculated by a complex query, so there
>>>> will
>>>> be ten SQLs to calculate
>>>> data for this table, for each sql:
>>>>
>>>> * First check whether there is a row for the specified `rowkey`.
>>>> * If yes, then `update`, otherwise `insert`.
>>>>
>>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>>> this(SQL2008), a sample usage is:
>>>>
>>>> MERGE INTO result D
>>>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>>>    ON (D.rowkey = S.rowkey)
>>>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>>
>>>>
>>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>>> rarely used this syntax.*
>>>>
>>>> So my colleagues invented a new syntax for this scenario (Or more
>>>> precisely, a new interpretation for the INSERT statement). For the above
>>>> scenario, user will always write `insert` statement:
>>>>
>>>> insert into result(rowkey, col1) values(...)
>>>> insert into result(rowkey, col2) values(...)
>>>>
>>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>>> exists, then update, otherwise `insert`. This solution is very concise,
>>>> but
>>>> violates the semantics of `insert`, using this solution INSERT will
>>>> behave
>>>> differently in batch & stream processing.
>>>>
>>>> How do you guys think? which do you prefer? What's your reasoning?
>>>>
>>>> Looking forward to your opinions, thanks in advance.
>>>>
>>> --
>>> Thanks,
>>>
>>> Jesse
>>>
>>


-- 
----
Mingmin

Reply via email to