Hello,

I have a Dataflow streaming pipeline where I need to consume queue messages 
(PubsubIO) and send each one through a very large lookup table from a database 
to join relevant values. I have tried the following methods:


1) Side Input

In this approach I kept the large lookup table as a side input (static map 
object) and sent each message through the map on each machine. Obviously this 
is not ideal for a streaming process, but it did work with smaller test 
datasets. However, with larger, production sized datasets, the side input 
continually failed with the error:


  exception:  "java.lang.RuntimeException: 
org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: 
ByteString would be too long: 590128395+1811498110
at 
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
at 
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:104)

... which implies a direct limit on the size of side inputs. So I've been 
working on the following strategy instead:


2) Key-value pair grouping

This is the main problem I've been dealing with. In this approach I attempted 
to convert both the large map table and the incoming queue messages to 
Key-Value pairs, and then combine the messages with the relevant map data using 
a Flatten merge transform and GroupByKey transform. Although this runs without 
error, it never outputs any data.


The core of the problem is here: streaming messages require a windowing 
strategy if they are to be merged with other data. However, the logic for 
windowing does make sense with the large map dataset - it should all be one 
window. All of the data needs to be sent at once, and it needs to be sent with 
every small batch of Pubsub messages streaming in (over and over again).  I 
attempted to window the map dataset with a static timestamp and then to create 
a window large enough to apply to all items - but of course, this just sends 
all the map data through the stream once in a single window, and then never 
again.


I am facing a situation where I need to repeatedly process a large amount of 
static, bounded data in a continuous stream - is the only solution to this to 
run many batch jobs?

Reply via email to