Hi Roman,

Thanks for the reply, here is the screenshot of the latest failed
checkpoint.

[image: Screenshot 2022-03-09 at 11.44.46 AM.png]

I couldn't find the details for the last successful one as we only store
the last 10 checkpoints' details. Also, can you give us an idea of exactly
what details you are looking for?

For the operator, the source operators for both the input streams look
fine, the Interval join operator seems to be having the issue of not
clearing the state it is holding.

IntervalJoin(joinType=[LeftOuterJoin], windowBounds=[isRowTime=true,
leftLowerBound=-5400000, leftUpperBound=0, leftTimeIndex=8,
rightTimeIndex=5], where=[((price_calculation_id = $f92) AND (rowtime0 >=
rowtime) AND (rowtime0 <= (rowtime + 5400000:INTERVAL MINUTE)))],

We are currently doubting the way we are generating watermarks for the new
Kafka source, they might be creating a problem as the output of
CURRENT_WATERMARK(rowtime)
is coming as null from the SQL Query.

Thanks
Prakhar

On Tue, Mar 8, 2022 at 5:49 PM Roman Khachatryan <ro...@apache.org> wrote:

> Hi Prakhar,
>
> Could you please share the statistics about the last successful and
> failed checkpoints, e.g. from the UI.
> Ideally, with detailed breakdown for the operators that seems problematic.
>
> Regards,
> Roman
>
> On Fri, Mar 4, 2022 at 8:48 AM Prakhar Mathur <prakha...@gojek.com> wrote:
> >
> > Hi,
> >
> > Can someone kindly help and take a look at this? It's a major blocker
> for us.
> >
> > Thanks,
> > Prakhar
> >
> > On Wed, Mar 2, 2022 at 2:11 PM Prakhar Mathur <prakha...@gojek.com>
> wrote:
> >>
> >> Hello,
> >>
> >> We recently did a migration of our Flink jobs from version 1.9.0 to
> 1.14.3. These jobs consume from Kafka and produce to respective sinks. We
> are using MemoryStateBackend for our checkpointing and GCS as our remote
> fs. After migration, we found a few jobs that had left join in the SQL
> query started failing where their checkpoint size kept increasing. We
> haven't changed the SQL Query. Following is one of the queries that have
> started failing with the issue mentioned.
> >>
> >> SELECT
> >> table1.field1,
> >> table2.field2,
> >> table2.field3,
> >> table1.rowtime as estimate_timestamp,
> >> table2.creation_time as created_timestamp,
> >> CAST(table2.rowtime AS TIMESTAMP)
> >> FROM
> >> table1
> >> LEFT JOIN table2 ON table1.field1 = coalesce(
> >> nullif(table2.field4, ''),
> >> table2.field5
> >> )
> >> AND table2.rowtime BETWEEN table1.rowtime
> >> AND table1.rowtime + INTERVAL '90' MINUTE
> >> WHERE
> >> table2.field6 IS NOT TRUE
> >>
> >> Few data points:
> >>
> >> On version 1.9.0 it was running on parallelism of 20, now it is not
> even able to run on 40.
> >> On version 1.9.0 the max checkpoint size was going up to 3.5 GB during
> peak hours. Now on 1.14.3, it just keeps on increasing and goes up to 30 GB
> and eventually fails due to lack of resources.
> >> Earlier in version 1.9.0, we were using
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer and now in
> 1.14.3 we have moved to the new Kafka Source.
> >>
> >> Any help will be highly appreciated as these are production jobs.
> >>
> >> Thanks
> >> Prakhar Mathur
>

Reply via email to