Hi all,

I've seen repeatedly the following pattern:
Consider a sql (Joining stream to table, from Calcite):
SELECT STREAM o.rowtime, o.productId, o.orderId, o.units,
  p.name, p.unitPrice
FROM Orders AS o
JOIN Products AS p
  ON o.productId = p.productId;
A stream-to-table join is straightforward if the contents of the table are not 
changing(or slowly changing). This query enriches a stream of orders with 
each product’s list price.

This table is mostly in HBase or Mysql or Redis. Most of our users think that 
we should use SideInputs to implement it. But there are some difficulties here:
1.Maybe this table is very large! AFAIK, SideInputs will load all data to 
internal. 
We can not load all, but we can do some caching work. 
2.This table may be updated periodically. As mentioned in 
https://issues.apache.org/jira/browse/BEAM-1197
3.Sometimes users want to update this table, in some scene which doesn’t 
need high accuracy. (The read and write to the external storage can’t guarantee 
Exacly-Once)

So we developed a component called DimState(Maybe the name is not right). 
Use cache(It is LoadingCache now) or load all.  They all have Time-To-Live 
mechanism. An abstract interface is called ExternalState. There are 
HBaseState, JDBCState, RedisState. It is accessed by key and namespace. 
Provides bulk access to the external table for performance.

Is there a better way to implement it? Can we make some abstracts in Beam 
Model? 

What do you think?

Best,
JingsongLee

Reply via email to