Dataflow pipeline problem: Streaming data combined with large bounded data

Taylor Coleman Wed, 22 Nov 2017 12:11:00 -0800

Hello,

I have a Dataflow streaming pipeline where I need to consume queue messages
(PubsubIO) and send each one through a very large lookup table from a database
to join relevant values. I have tried the following methods:

1) Side Input

In this approach I kept the large lookup table as a side input (static map
object) and sent each message through the map on each machine. Obviously this
is not ideal for a streaming process, but it did work with smaller test
datasets. However, with larger, production sized datasets, the side input
continually failed with the error:

exception: "java.lang.RuntimeException:
org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException:
ByteString would be too long: 590128395+1811498110
at
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
at
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:104)

... which implies a direct limit on the size of side inputs. So I've been
working on the following strategy instead:

2) Key-value pair grouping

This is the main problem I've been dealing with. In this approach I attempted
to convert both the large map table and the incoming queue messages to
Key-Value pairs, and then combine the messages with the relevant map data using
a Flatten merge transform and GroupByKey transform. Although this runs without
error, it never outputs any data.

The core of the problem is here: streaming messages require a windowing
strategy if they are to be merged with other data. However, the logic for
windowing does make sense with the large map dataset - it should all be one
window. All of the data needs to be sent at once, and it needs to be sent with
every small batch of Pubsub messages streaming in (over and over again). I
attempted to window the map dataset with a static timestamp and then to create
a window large enough to apply to all items - but of course, this just sends
all the map data through the stream once in a single window, and then never
again.

I am facing a situation where I need to repeatedly process a large amount of
static, bounded data in a continuous stream - is the only solution to this to
run many batch jobs?

Dataflow pipeline problem: Streaming data combined with large bounded data

Reply via email to