Hello,
Did you tried to specify cache size for side input?
——workerCacheMB=...

In case of merging data by key you can try also to use statefull pardo and
global window. I used also combines and CoGroupByKey before statefull ParDo.

Best regards
Aleksandr Gortujev

22. nov 2017 10:10 PM kirjutas kuupäeval "Taylor Coleman" <
[email protected]>:

Hello,


I have a Dataflow streaming pipeline where I need to consume queue messages
(PubsubIO) and send each one through a very large lookup table from a
database to join relevant values. I have tried the following methods:


1) Side Input

In this approach I kept the large lookup table as a side input (static map
object) and sent each message through the map on each machine. Obviously
this is not ideal for a streaming process, but it did work with smaller
test datasets. However, with larger, production sized datasets, the side
input continually failed with the error:


  exception:  "java.lang.RuntimeException:
org.apache.beam.sdk.util.UserCodeException:
java.lang.IllegalArgumentException: ByteString would be too long:
590128395+1811498110
at com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(
GroupAlsoByWindowsParDoFn.java:182)
at com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.
outputWindowedValue(GroupAlsoByWindowFnRunner.java:104)

... which implies a direct limit on the size of side inputs. So I've been
working on the following strategy instead:


2) Key-value pair grouping

This is the main problem I've been dealing with. In this approach I
attempted to convert both the large map table and the incoming queue
messages to Key-Value pairs, and then combine the messages with the
relevant map data using a Flatten merge transform and GroupByKey transform.
Although this runs without error, it never outputs any data.


The core of the problem is here: streaming messages require a windowing
strategy if they are to be merged with other data. However, the logic for
windowing does make sense with the large map dataset - it should all be one
window. All of the data needs to be sent at once, and it needs to be sent
with every small batch of Pubsub messages streaming in (over and over
again).  I attempted to window the map dataset with a static timestamp and
then to create a window large enough to apply to all items - but of course,
this just sends all the map data through the stream once in a single
window, and then never again.


I am facing a situation where I need to repeatedly process a large amount
of static, bounded data in a continuous stream - is the only solution to
this to run many batch jobs?

Reply via email to