Re: Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Neng Lu Mon, 19 Jul 2021 15:55:02 -0700

Based on my local test, it's fine for String Schema.


On 2021/07/19 18:47:49 Devin Bost wrote:
> > This leads to an IncompatibleClassChangeError  when you have a Function or
> > a Connector that is using Schema.JSON(Pojo.class)
> 
> I just noticed this detail. Do we have a sense of how often people are
> using Schema.JSON in Functions/Connectors?
> Most of our functions are using a string schema, so it's not clear to me if
> they would be impacted.
> 
> Devin G. Bost
> 
> 
> On Mon, Jul 19, 2021 at 12:41 PM Devin Bost <[email protected]> wrote:
> 
> > > I think Sijie is referring to using KubernetesRuntime to deploy functions
> > > where each function/source/sink runs as an independent statefulset in
> > K8s.
> > > In this scenario, it is possible to have fine grained control over which
> > > version of the function container the function is using.
> >
> > Not everybody is using the KubernetesRuntime yet (especially since the
> > Helm charts aren't feature-complete), and it appears that those who aren't
> > running KubernetesRuntime would be impacted the most by this issue.
> >
> > Devin G. Bost
> >
> >
> > On Mon, Jul 19, 2021 at 12:36 PM Devin Bost <[email protected]> wrote:
> >
> >> > For example, if you are upgrading Flink from one version to the other
> >> > version, you have to make a save point in the previous version for all
> >> > the Flink jobs.
> >> > Upgrade the Flink cluster and resume jobs in a new version.
> >> >
> >> >
> >> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/
> >> >
> >> > So it is not unreasonable for asking people to do that when dealing
> >> > with upgrading a centralized computing engine.
> >>
> >> One difference with Flink is that organizations running Flink in job mode
> >> or application mode can upgrade jobs independently of one another, so teams
> >> can upgrade jobs when they are ready without impacting other teams. In the
> >> Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster would
> >> break every tenant simultaneously and would block the flow of all messages
> >> until all functions are upgraded. If one team takes a year to upgrade their
> >> one function, the cluster could not be upgraded until that happened. Also,
> >> after all the functions have been upgraded, there would be production
> >> downtime while deploying all the upgraded functions, which would be a major
> >> outage... It might be possible to write a script to speed up the deployment
> >> to shrink the outage window, but there's currently a bug that wipes out
> >> existing userConfigs when a function is upgraded, so that adds to the
> >> complexity of upgrading all the functions since someone would need to know
> >> all the userConfigs for all the functions.
> >>
> >> So, I don't think we're really comparing the same things here.
> >>
> >> Devin G. Bost
> >>
> >>
> >> On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <[email protected]> wrote:
> >>
> >>> On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <[email protected]>
> >>> wrote:
> >>> >
> >>> > I agree that the best we can do right now is to just clearly document
> >>> this
> >>> > as a potential problem when updating 2.7 to 2.8.
> >>> >
> >>> > We should definitely make every attempt to not make BC breaking
> >>> changes.
> >>> > However, there are times when we have to make these tough decisions
> >>> for one
> >>> > reason or another. The bigger problem I see here is not necessarily a
> >>> BC
> >>> > breaking change occurred, but rather we didn't know about it
> >>> beforehand so
> >>> > we can clearly document this caveat when 2.8 is released.  Perhaps
> >>> this is
> >>> > where we can improve our backwards compatibility testing.  We already
> >>> have
> >>> > some but probably not enough as highlighted by this case.
> >>> >
> >>> > In regards to
> >>> >
> >>> > This is partially correct, because you can wait to upgrade the workers
> >>> pod,
> >>> > > but there is no fine grained control over which version  of each pod
> >>> will
> >>> > > be running your function, especially in a big cluster with many
> >>> tenants and
> >>> > > functions with this problem
> >>> > >
> >>> >
> >>> >
> >>> > I think Sijie is referring to using KubernetesRuntime to deploy
> >>> functions
> >>> > where each function/source/sink runs as an independent statefulset in
> >>> K8s.
> >>> > In this scenario, it is possible to have fine grained control over
> >>> which
> >>> > version of the function container the function is using.  There
> >>> currently
> >>> > might not be tools to easily allow users to do this but using kubectl
> >>> one
> >>> > can definitely determine which container version is running and
> >>> potentially
> >>> > update the container version on a per function basis.
> >>>
> >>> Jerry - Thank you! That was what I meant.
> >>>
> >>> >
> >>> > Best,
> >>> >
> >>> > Jerry
> >>> >
> >>> > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <[email protected]>
> >>> > wrote:
> >>> >
> >>> > > Sijie,
> >>> > > Thank you for your feedback
> >>> > > Some additional considerations inline
> >>> > >
> >>> > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <[email protected]> ha scritto:
> >>> > >
> >>> > > > I don't think this is a big problem. Because people can recompile
> >>> the
> >>> > > > function and submit the function. Most of the computing/streaming
> >>> > > > engines ask users to recompile the jobs and resubmit the jobs when
> >>> it
> >>> > > > upgrades to a new version.
> >>> > >
> >>> > >
> >>> > > Unfortunately this is not easily feasible if the org that is
> >>> managing the
> >>> > > Pulsar service is different from the org who is developing the
> >>> Functions.
> >>> > > And especially it is quite impossible to prevent service
> >>> interruption.
> >>> > >
> >>> > > BTW I believe that there is no way to fix this at this point.
> >>> > >
> >>> > > The best approach here is to document this
> >>> > > > behavior.
> >>> > > >
> >>> > >
> >>> > > I agree that the best thing we can do is to document this
> >>> requirement.
> >>> > >
> >>> > > Therefore we must ensure in the future that we won't fall again into
> >>> this
> >>> > > kind of issues.
> >>> > >
> >>> > > Pulsar is becoming more and more used by large enterprises and
> >>> backward
> >>> > > compatibility is a big value.
> >>> > >
> >>> > > Fortunately not all the Functions need rebuilding.
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > > Also, if you are using Kubernetes runtime to schedule functions,
> >>> you
> >>> > > > are not really impacted.
> >>> > > >
> >>> > >
> >>> > > This is partially correct, because you can wait to upgrade the
> >>> workers pod,
> >>> > > but there is no fine grained control over which version  of each pod
> >>> will
> >>> > > be running your function, especially in a big cluster with many
> >>> tenants and
> >>> > > functions with this problem
> >>> > >
> >>> > >
> >>> > > Enrico
> >>> > >
> >>> > >
> >>> > > > - Sijie
> >>> > > >
> >>> > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli <
> >>> [email protected]>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > Hello,
> >>> > > > > I have reported this issue [1] about upgrading from Pulsar 2.7
> >>> to 2.8.
> >>> > > > > More information is on the ticket, but the short version of the
> >>> story
> >>> > > is
> >>> > > > > that
> >>> > > > > in Pulsar 2.8 we introduced a breaking change in the Schema API,
> >>> by
> >>> > > > > switching SchemaInfo from a class to an interface.
> >>> > > > >
> >>> > > > > This leads to an IncompatibleClassChangeError  when you have a
> >>> Function
> >>> > > > or
> >>> > > > > a Connector that is using Schema.JSON(Pojo.class) and you
> >>> upgrade your
> >>> > > > > Pulsar cluster (the functions worker pod for instance) from
> >>> Pulsar
> >>> > > 2.7.x
> >>> > > > to
> >>> > > > > Pulsar 2.8.0.
> >>> > > > >
> >>> > > > > The bad problem is that you cannot upgrade Pulsar without
> >>> interrupting
> >>> > > > the
> >>> > > > > service and coordinating with the upgrade of the Functions.
> >>> > > > > Your functions need to be recompiled against the Pulsar 2.8 API
> >>> and
> >>> > > > > deployed again in production.
> >>> > > > >
> >>> > > > > I have tried to move back SchemaInfo to an "abstract class" but
> >>> without
> >>> > > > > success, because then you fall into errors.
> >>> > > > >
> >>> > > > > I am not sure there is a way to provide a good "upgrade path" for
> >>> > > > > Functions/IO users.
> >>> > > > >
> >>> > > > > If we do not find a way we have to document the upgrade in the
> >>> official
> >>> > > > > Pulsar Documentation.
> >>> > > > >
> >>> > > > > We must do our best to prevent users from falling again into
> >>> this bad
> >>> > > > > situation.
> >>> > > > >
> >>> > > > > Any suggestions or thoughts ?
> >>> > > > >
> >>> > > > > Regards
> >>> > > > > Enrico
> >>> > > > >
> >>> > > > > [1] https://github.com/apache/pulsar/issues/11338
> >>> > > >
> >>> > >
> >>>
> >>
>

Re: Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Reply via email to