Hey Theo, we recently published a blog post that answers your request for a comparison between Kryo and Avro in Flink: https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html
On Tue, Mar 10, 2020 at 9:27 AM Arvid Heise <[email protected]> wrote: > Hi Theo, > > I strongly discourage the use of flink serialization for persistent > storage of data. It was never intended to work in this way and does not > offer the benefits of Avro of lazy schema evolution and maturity. > > Unless you can explicitly measure that Avro is a bottleneck in your setup, > stick with it. It's the preferred way to store data in Kafka for a reason. > It's mature, supports plenty of languages, and the schema evolution feature > will save you so many headaches in the future. > > If it turns out to be a bottleneck, the most logical alternative is > protobuf. Kryo is even worse than Flink serializer for Kafka. In general, > realistically speaking, it's so much more cost-effective to just add > another node to your Flink cluster and use Avro than coming up with any > clever solution (just assume that you need at least one man month to > implement and do the math). > > And btw, you should always use generated Java/scala classes if possible > for Avro. It's faster and offers a much nicer development experience. > > On Mon, Mar 9, 2020 at 3:57 PM Robert Metzger <[email protected]> wrote: > >> Hi Theo, >> >> However, in most benchmarks, avro turns out to be rather slow in terms of >>> CPU cycles ( e.g. [1] <https://github.com/eishay/jvm-serializers/wiki> ) >> >> >> Avro is slower compared to what? >> You should not only benchmark the CPU cycles for serializing the data. If >> you are sending JSON strings across the network, you'll probably have a lot >> more bytes to send across the network, making everything slower (usually >> network is slower than CPU) >> >> One of the reasons why people use Avro it supports schema evolution. >> >> Regarding your questions: >> 1. For this use case, you can use the Flink data format as an internal >> message format (between the star architecture jobs) >> 2. Generally speaking no >> 3. You will at leave have a dependency to flink-core. And this is a >> somewhat custom setup, so you might be facing breaking API changes. >> 4. I'm not aware of any benchmarks. The Flink serializers are mostly for >> internal use (between our operators), Kryo is our fallback (to not suffer >> to much from the not invented here syndrome), while Avro is meant for >> cross-system serialization. >> >> I have the feeling that you can move ahead with using Flink's Pojo >> serializer everywhere :) >> >> Best, >> Robert >> >> >> >> >> On Wed, Mar 4, 2020 at 1:04 PM Theo Diefenthal < >> [email protected]> wrote: >> >>> Hi, >>> >>> Without knowing too much about flink serialization, I know that Flinks >>> states that it serializes POJOtypes much faster than even the fast Kryo for >>> Java. I further know that it supports schema evolution in the same way as >>> avro. >>> >>> In our project, we have a star architecture, where one flink job >>> produces results into a kafka topic and where we have multiple downstream >>> consumers from that kafka topic (Mostly other flink jobs). >>> For fast development cycles, we currently use JSON as output format for >>> the kafka topic due to easy debugging capabilities and best migration >>> possibilities. However, when scaling up, we need to switch to a more >>> efficient format. Most often, Avro is mentioned in combination with a >>> schema registry, as its much more efficient then JSON where essentially, >>> each message contains the schema as well. However, in most benchmarks, avro >>> turns out to be rather slow in terms of CPU cycles ( e.g. [1] >>> <https://github.com/eishay/jvm-serializers/wiki> ) >>> >>> My question(s) now: >>> 1. Is it reasonable to use flink serializers as message format in Kafka? >>> 2. Are there any downsides in using flinks serialization result as >>> output format to kafka? >>> 3. Can downstream consumers, written in Java, but not flink components, >>> also easily deserialize flink serialized POJOs? Or do they have a >>> dependency to at least full flink-core? >>> 4. Do you have benchmarks comparing flink (de-)serialization performance >>> to e.g. kryo and avro? >>> >>> The only thing I come up with why I wouldn't use flink serialization is >>> that we wouldn't have a schema registry, but in our case, we share all our >>> POJOs in a jar which is used by all components, so that is kind of a schema >>> registry already and if we only make avro compatible changes, which are >>> also well treated by flink, that shouldn't be any limitation compared to >>> like avro+registry? >>> >>> Best regards >>> Theo >>> >>> [1] https://github.com/eishay/jvm-serializers/wiki >>> >>
