Hi Eddy,
does your data get buffered in a state - e.g. does the size of the state
grow over time? Do you see watermark being updated in your Flink WebUI?
When a stateful operation (and GroupByKey is a stateful operation) does
not output any data, the first place to look at is if watermark
correctly progresses. If it does not progress, then the input data must
be buffered in state and the size of the state should grow over time. If
it progresses, then it might be the case, that the data is too late
after the watermark (the watermark estimator might need tuning) and the
data gets dropped (note you don't set any allowed lateness, which
_might_ cause issues). You could see if your pipeline drops data in
"droppedDueToLateness" metric. The size of you state would not grow much
in that situation.
Another hint - If you use KafkaIO, try to disable SDF wrapper for it
using "--experiments=use_deprecated_read" on command line (which you
then must pass to PipelineOptionsFactory). There is some suspicion that
SDF wrapper for Kafka might not work as expected in certain situations
with Flink.
Please feel free to share any results,
Jan
On 6/14/21 1:39 PM, Eddy G wrote:
As seen in this image https://imgur.com/a/wrZET97, I'm trying to deal with late
data (intentionally stopped my consumer so data has been accumulating for
several days now). Now, with the following Window... I'm using Beam 2.27 and
Flink 1.12.
Window.into(FixedWindows.of(Duration.standardMinutes(10)))
And several parsing stages after, once it's time to write within the ParquetIO
stage...
FileIO
.<String, MyClass>writeDynamic()
.by(...)
.via(...)
.to(...)
.withNaming(...)
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(options.getNumShards())
it won't send bytes across all stages so no data is being written, still it
accumulates in the first stage seen in the image and won't go further than that.
Any reason why this may be happening? Wrong windowing strategy?