Hi Tyler, I think upsert is a good alternative, concise as INSERT and have the valid semantics. Just that user seems rarely use UPSERT either(might because there's no UPDATE in batch big data processing).
By *"INSERT will behave differently in batch & stream processing"* I mean, if we use the "INSERT" solution I described above, there will be ten INSERTs: *INSERT INTO result(rowkey, col1) values(...)* *INSERT INTO result(rowkey, col2) values(...)* *...INSERT INTO result(rowkey, col10) values(...)* Although we issued ten INSERTs, but there will be only ONE new records in the target table, because 9 of the INSERTs are actually UPDATing the record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in batch,* INSERT is just INSERT*. I think the essence of this problem is, there is no UPDATE in batch, but require UPDATE in streaming. Tyler Akidau <taki...@google.com>于2017年6月22日周四 下午11:35写道: > Calcite appears to have UPSERT > <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just > use that instead? > > Also, I don't understand your statement that "INSERT will behave > differently in batch & stream processing". Can you explain further? > > > -Tyler > > > On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io> > wrote: > >> If I'm understanding correctly, Hive does that with a insert into >> followed by a select statement that does the aggregation. >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries >> >> On Thu, Jun 22, 2017 at 1:32 AM James <xumingmi...@gmail.com> wrote: >> >>> Hi team, >>> >>> I am thinking about a SQL and stream computing related problem, want to >>> hear your opinions. >>> >>> In stream computing, there is a typical case like this: >>> >>> *We want to calculate a big wide result table, which has one rowkey and >>> ten >>> value columns:* >>> *create table result (* >>> * rowkey varchar(127) PRIMARY KEY,* >>> * col1 int,* >>> * col2 int,* >>> * ...* >>> * col10 int* >>> *);* >>> >>> Each of the value columns is calculated by a complex query, so there will >>> be ten SQLs to calculate >>> data for this table, for each sql: >>> >>> * First check whether there is a row for the specified `rowkey`. >>> * If yes, then `update`, otherwise `insert`. >>> >>> There is actually a dedicated sql syntax called `MERGE` designed for >>> this(SQL2008), a sample usage is: >>> >>> MERGE INTO result D >>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S >>> ON (D.rowkey = S.rowkey) >>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1 >>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1) >>> >>> >>> *The semantic fits perfectly, but it is very verbose, and normal users >>> rarely used this syntax.* >>> >>> So my colleagues invented a new syntax for this scenario (Or more >>> precisely, a new interpretation for the INSERT statement). For the above >>> scenario, user will always write `insert` statement: >>> >>> insert into result(rowkey, col1) values(...) >>> insert into result(rowkey, col2) values(...) >>> >>> The sql interpreter will do a trick behind the scene: if the `rowkey` >>> exists, then update, otherwise `insert`. This solution is very concise, >>> but >>> violates the semantics of `insert`, using this solution INSERT will >>> behave >>> differently in batch & stream processing. >>> >>> How do you guys think? which do you prefer? What's your reasoning? >>> >>> Looking forward to your opinions, thanks in advance. >>> >> -- >> Thanks, >> >> Jesse >> >