Hi Xiaogang,

I think this proposal doesn't conflict with your use case, you can still
chain a ProcessFunction after a source which emits raw data.
But I'm not in favor of chaining ProcessFunction after source, and we
should avoid that, because:

1) For correctness, it is necessary to perform the watermark generation as
early as possible in order to be close to the actual data
 generation within a source's data partition. This is also the purpose of
per-partition watermark and event-time alignment.
 Many on going FLIPs (e.g. FLIP-27, FLIP-95) works a lot on this effort.
Deseriazing records and generating watermark in chained
 ProcessFunction makes it difficult to do per-partition watermark in the
future.
2) In Flink SQL, a source should emit the deserialized row instead of raw
data. Otherwise, users have to define raw byte[] as the
 single column of the defined table, and parse them in queries, which is
very inconvenient.

Best,
Jark

On Fri, 10 Apr 2020 at 09:18, SHI Xiaogang <shixiaoga...@gmail.com> wrote:

> Hi,
>
> I don't think the proposal is a good solution to the problems. I am in
> favour of using a ProcessFunction chained to the source/sink function to
> serialize/deserialize the records, instead of embedding (de)serialization
> schema in source/sink function.
>
> Message packing is heavily used in our production environment to allow
> compression and improve throughput. As buffered messages have to be
> delivered when the time exceeds the limit, timers are also required in our
> cases. I think it's also a common need for other users.
>
> In the this proposal, with more components added into the context, in the
> end we will find the serialization/deserialization schema is just another
> wrapper of ProcessFunction.
>
> Regards,
> Xiaogang
>
> Aljoscha Krettek <aljos...@apache.org> 于2020年4月7日周二 下午6:34写道:
>
> > On 07.04.20 08:45, Dawid Wysakowicz wrote:
> >
> > > @Jark I was aware of the implementation of SinkFunction, but it was a
> > > conscious choice to not do it that way.
> > >
> > > Personally I am against giving a default implementation to both the new
> > > and old methods. This results in an interface that by default does
> > > nothing or notifies the user only in the runtime, that he/she has not
> > > implemented a method of the interface, which does not sound like a good
> > > practice to me. Moreover I believe the method without a Collector will
> > > still be the preferred method by many users. Plus it communicates
> > > explicitly what is the minimal functionality required by the interface.
> > > Nevertheless I am happy to hear other opinions.
> >
> > Dawid and I discussed this before. I did the extension of the
> > SinkFunction but by now I think it's better to do it this way, because
> > otherwise you can implement the interface without implementing any
> methods.
> >
> > > @all I also prefer the buffering approach. Let's wait a day or two more
> > > to see if others think differently.
> >
> > I'm also in favour of buffering outside the lock.
> >
> > Also, +1 to this FLIP.
> >
> > Best,
> > Aljoscha
> >
>

Reply via email to