Would like to share my thoughts in another perspective. IMO this is a typical scenario for column based databases, like Hbase/Cassandra. You may need to choose a right database if possible.
UPSERT is another alternative option, but I wouldn't suggest to a customized check-insert/check-update implementation. The actual job should be done in database side. On Thu, Jun 22, 2017 at 6:59 PM, James <xumingmi...@gmail.com> wrote: > Hi Tyler, > > I think upsert is a good alternative, concise as INSERT and have the valid > semantics. Just that user seems rarely use UPSERT either(might because > there's no UPDATE in batch big data processing). > > By *"INSERT will behave differently in batch & stream processing"* I > mean, if we use the "INSERT" solution I described above, there will be ten > INSERTs: > > *INSERT INTO result(rowkey, col1) values(...)* > > *INSERT INTO result(rowkey, col2) values(...)* > > *...INSERT INTO result(rowkey, col10) values(...)* > > Although we issued ten INSERTs, but there will be only ONE new records in > the target table, because 9 of the INSERTs are actually UPDATing the > record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in > batch,* INSERT is just INSERT*. > > I think the essence of this problem is, there is no UPDATE in batch, but > require UPDATE in streaming. > > > > Tyler Akidau <taki...@google.com>于2017年6月22日周四 下午11:35写道: > >> Calcite appears to have UPSERT >> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just >> use that instead? >> >> Also, I don't understand your statement that "INSERT will behave >> differently in batch & stream processing". Can you explain further? >> >> >> -Tyler >> >> >> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io> >> wrote: >> >>> If I'm understanding correctly, Hive does that with a insert into >>> followed by a select statement that does the aggregation. https://cwiki. >>> apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML- >>> InsertingdataintoHiveTablesfromqueries >>> >>> On Thu, Jun 22, 2017 at 1:32 AM James <xumingmi...@gmail.com> wrote: >>> >>>> Hi team, >>>> >>>> I am thinking about a SQL and stream computing related problem, want to >>>> hear your opinions. >>>> >>>> In stream computing, there is a typical case like this: >>>> >>>> *We want to calculate a big wide result table, which has one rowkey and >>>> ten >>>> value columns:* >>>> *create table result (* >>>> * rowkey varchar(127) PRIMARY KEY,* >>>> * col1 int,* >>>> * col2 int,* >>>> * ...* >>>> * col10 int* >>>> *);* >>>> >>>> Each of the value columns is calculated by a complex query, so there >>>> will >>>> be ten SQLs to calculate >>>> data for this table, for each sql: >>>> >>>> * First check whether there is a row for the specified `rowkey`. >>>> * If yes, then `update`, otherwise `insert`. >>>> >>>> There is actually a dedicated sql syntax called `MERGE` designed for >>>> this(SQL2008), a sample usage is: >>>> >>>> MERGE INTO result D >>>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S >>>> ON (D.rowkey = S.rowkey) >>>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1 >>>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1) >>>> >>>> >>>> *The semantic fits perfectly, but it is very verbose, and normal users >>>> rarely used this syntax.* >>>> >>>> So my colleagues invented a new syntax for this scenario (Or more >>>> precisely, a new interpretation for the INSERT statement). For the above >>>> scenario, user will always write `insert` statement: >>>> >>>> insert into result(rowkey, col1) values(...) >>>> insert into result(rowkey, col2) values(...) >>>> >>>> The sql interpreter will do a trick behind the scene: if the `rowkey` >>>> exists, then update, otherwise `insert`. This solution is very concise, >>>> but >>>> violates the semantics of `insert`, using this solution INSERT will >>>> behave >>>> differently in batch & stream processing. >>>> >>>> How do you guys think? which do you prefer? What's your reasoning? >>>> >>>> Looking forward to your opinions, thanks in advance. >>>> >>> -- >>> Thanks, >>> >>> Jesse >>> >> -- ---- Mingmin