Hi Eleanore,

I have a few more questions regarding your issue.

   - Which Flink version are you using?
   - Is this skew observed only after a scaling-up? What happens if the
   parallelism is initially set to the scaled-up value?
   - Keeping the job running a while after the scale-up, does the skew ease?

I suspect the performance difference might be an outcome of some warming up
issues. E.g., the existing TMs might have some file already localized, or
some memory buffers already promoted to the JVM tenured area, while the new
TMs have not.

Thank you~

Xintong Song



On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <eleanore....@gmail.com> wrote:

> Hi Experts,
> I have my flink application running on Kubernetes, initially with 1 Job
> Manager, and 2 Task Managers.
>
> Then we have the custom operator that watches for the CRD, when the CRD
> replicas changed, it will patch the Flink Job Manager deployment
> parallelism and max parallelism according to the replicas from CRD
> (parallelism can be configured via env variables for our application).
> which causes the job manager restart. hence a new Flink job. But the
> consumer group does not change, so it will continue from the offset
> where it left.
>
> In addition, operator will also update Task Manager's deployment replicas,
> and will adjust the pod number.
>
> In case of scale up, the existing task manager pods do not get killed, but
> new task manager pods will be created.
>
> And we observed a skew in the partition offset consumed. e.g. some
> partitions have huge lags and other partitions have small lags. (observed
> from burrow)
>
> This is also validated by the metrics from Flink UI, showing the
> throughput differs for slotss
>
> Any clue why this is the case?
>
> Thanks a lot!
> Eleanore
>

Reply via email to