Calcite appears to have UPSERT <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just use that instead?
Also, I don't understand your statement that "INSERT will behave differently in batch & stream processing". Can you explain further? -Tyler On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io> wrote: > If I'm understanding correctly, Hive does that with a insert into followed > by a select statement that does the aggregation. > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries > > On Thu, Jun 22, 2017 at 1:32 AM James <xumingmi...@gmail.com> wrote: > >> Hi team, >> >> I am thinking about a SQL and stream computing related problem, want to >> hear your opinions. >> >> In stream computing, there is a typical case like this: >> >> *We want to calculate a big wide result table, which has one rowkey and >> ten >> value columns:* >> *create table result (* >> * rowkey varchar(127) PRIMARY KEY,* >> * col1 int,* >> * col2 int,* >> * ...* >> * col10 int* >> *);* >> >> Each of the value columns is calculated by a complex query, so there will >> be ten SQLs to calculate >> data for this table, for each sql: >> >> * First check whether there is a row for the specified `rowkey`. >> * If yes, then `update`, otherwise `insert`. >> >> There is actually a dedicated sql syntax called `MERGE` designed for >> this(SQL2008), a sample usage is: >> >> MERGE INTO result D >> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S >> ON (D.rowkey = S.rowkey) >> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1 >> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1) >> >> >> *The semantic fits perfectly, but it is very verbose, and normal users >> rarely used this syntax.* >> >> So my colleagues invented a new syntax for this scenario (Or more >> precisely, a new interpretation for the INSERT statement). For the above >> scenario, user will always write `insert` statement: >> >> insert into result(rowkey, col1) values(...) >> insert into result(rowkey, col2) values(...) >> >> The sql interpreter will do a trick behind the scene: if the `rowkey` >> exists, then update, otherwise `insert`. This solution is very concise, >> but >> violates the semantics of `insert`, using this solution INSERT will behave >> differently in batch & stream processing. >> >> How do you guys think? which do you prefer? What's your reasoning? >> >> Looking forward to your opinions, thanks in advance. >> > -- > Thanks, > > Jesse >