Hello,
I have a Dataflow streaming pipeline where I need to consume queue messages (PubsubIO) and send each one through a very large lookup table from a database to join relevant values. I have tried the following methods: 1) Side Input In this approach I kept the large lookup table as a side input (static map object) and sent each message through the map on each machine. Obviously this is not ideal for a streaming process, but it did work with smaller test datasets. However, with larger, production sized datasets, the side input continually failed with the error: exception: "java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: ByteString would be too long: 590128395+1811498110 at com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182) at com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:104) ... which implies a direct limit on the size of side inputs. So I've been working on the following strategy instead: 2) Key-value pair grouping This is the main problem I've been dealing with. In this approach I attempted to convert both the large map table and the incoming queue messages to Key-Value pairs, and then combine the messages with the relevant map data using a Flatten merge transform and GroupByKey transform. Although this runs without error, it never outputs any data. The core of the problem is here: streaming messages require a windowing strategy if they are to be merged with other data. However, the logic for windowing does make sense with the large map dataset - it should all be one window. All of the data needs to be sent at once, and it needs to be sent with every small batch of Pubsub messages streaming in (over and over again). I attempted to window the map dataset with a static timestamp and then to create a window large enough to apply to all items - but of course, this just sends all the map data through the stream once in a single window, and then never again. I am facing a situation where I need to repeatedly process a large amount of static, bounded data in a continuous stream - is the only solution to this to run many batch jobs?
