Hello everyone,
We encounter a weird issue while running a Python + Beam streaming job on
Google Cloud Dataflow.
The job listens to a PubSub subscription of events, and my pipeline looks
like this:
messages = (
> p | "Read Topic" >>
> beam.io.ReadFromPubSub(subscription=options.subscription.get())
> | "JSON" >> beam.Map(json.loads)
> )
> sessions = (
> messages | "Add Keys" >> beam.WithKeys(lambda x: x["id"])
> | "Session Window" >>
> beam.WindowInto(beam.window.Sessions(SESSION_TIMEOUT))
> | beam.GroupByKey()
> | "Analyze Session" >> beam.ParDo(AnalyzeSession())
> )
> sessions | beam.io.WriteToPubSub(topic=options.session_topic.get())
After it runs for some time without any issues, I suddenly start getting
the following error:
TypeError: '_ConcatSequence' object is not subscriptable
Instead of getting the expected key value pair I usually get:
> ('ID123', [{...},{...},{...}])
I start getting:
> ('ID234', <apache_beam.coders.coder_impl._ConcatSequence object at
> 0x7feca40d1d90>)
I suspect this happens due to a heavy load, but I could not find any
information on why it could happen and how to mitigate it.
Any help would be much appreciated!
Thanks.