Hi all,
I've seen repeatedly the following pattern: Consider a sql (Joining stream to table, from Calcite): SELECT STREAM o.rowtime, o.productId, o.orderId, o.units, p.name, p.unitPrice FROM Orders AS o JOIN Products AS p ON o.productId = p.productId; A stream-to-table join is straightforward if the contents of the table are not changing(or slowly changing). This query enriches a stream of orders with each product’s list price. This table is mostly in HBase or Mysql or Redis. Most of our users think that we should use SideInputs to implement it. But there are some difficulties here: 1.Maybe this table is very large! AFAIK, SideInputs will load all data to internal. We can not load all, but we can do some caching work. 2.This table may be updated periodically. As mentioned in https://issues.apache.org/jira/browse/BEAM-1197 3.Sometimes users want to update this table, in some scene which doesn’t need high accuracy. (The read and write to the external storage can’t guarantee Exacly-Once) So we developed a component called DimState(Maybe the name is not right). Use cache(It is LoadingCache now) or load all. They all have Time-To-Live mechanism. An abstract interface is called ExternalState. There are HBaseState, JDBCState, RedisState. It is accessed by key and namespace. Provides bulk access to the external table for performance. Is there a better way to implement it? Can we make some abstracts in Beam Model? What do you think? Best, JingsongLee