Hi,
We have a Flink job that reads data from an input stream, then converts each
event from JSON string Avro object, finally writes to parquet files using
StreamingFileSink with OnCheckPointRollingPolicy of 5 mins. Basically a
stateless job. Initially, we use one map operator to convert Json string to
Avro object, Inside the map function, it goes form String -> JsonObject -> Avro
object.
DataStream<AvroSchema> avroData = data.map(new JsonToAVRO());
When we try to break the map operator to two, one for String to JsonObject,
another for JsonObject to Avro.
DataStream<JsonObject> JsonData = data.map(new StringToJson());
DataStream<AvroSchema> avroData = rawDataAsJson.map(new
JsonToAvroSchema())
The benchmark shows significant performance hit when breaking down to two
operators. We try to understand the Flink internal on why such a big
difference. The setup is using state backend = filesystem. Checkpoint = s3
bucket. Our event object has 300+ attributes.
Thanks
Ivan