MOHIL created BEAM-10019:
----------------------------

             Summary: Keeping keys in a state for a very long time (keys expiry 
unknown)
                 Key: BEAM-10019
                 URL: https://issues.apache.org/jira/browse/BEAM-10019
             Project: Beam
          Issue Type: Improvement
          Components: website
            Reporter: MOHIL


I have a use case which I think might be a good addition to the pipelines 
patterns:

 
beam (java sdk) reads two kind of records from data stream like Kafka:
 
1. Records of type A containing key and corresponding metadata. 
2. Records of type B containing the same key, but no metadata. Beam then needs 
to fill metadata for records of type B  by doing a lookup for metadata using 
keys received in records of type A. 
 
Idea is to save metadata or rather state for keys received in records of type A 
and then do a lookup when records of type B are received.
 Beam's "@State" construct  can be used here, however, problem is that we don't 
know when keys should expire. I don't think keeping a global window will be a 
good idea as there could be many keys (may be millions over a period of time) 
to be saved in a state.
 
One possible solution as suggested by Reza Ardeshir Rokni (raro...@gmail.com):
 
We can maintain a state in a large fixed window (1 day or so), so that GC can 
happen within a window bound. After window expire, save the metadata values in 
an external DB like BigQuery. If we get a record with same key in a new window 
looking for this metadata, fetch the metadata for that key from external DB and 
save it in window's state again.
 
 
 
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to