Hi everyone,
I think I have quite a standard problem and maybe the answer would be
quick, but I can't find it on the internet.
We have avro messages in Kafka topic, written with HWX schema reference.
We're able to read them in with e.g. ConsumeKafkaRecord with Avro reader.

Now we would like to merge smaller flowfiles to larger files, because we
load these files to HDFS. What combination of processors should we use to
get this with the highest performance?
Option 1: ConsumeKafkaRecord with AvroReader and AvroRecordSetWriter, then
MergeRecord with AvroReader/AvroRecordSetWriter. It works, it seems
straight forward, but for me it looks like there is too many
interpretations and rewrites of records. Each records interpretation is an
unnecessary cost of deserialization and then serialization through java
heap.

Option 2: somehow configure ConsumeKafka and MergeContent to do this? We
used this combination for simple jsons (with binary concatenation), but we
can't get it right with avro messages with schema reference (PutParquet
processor can't read merged files with AvroReader). On the other side, this
should be the fastest as there is no data interpretation, just byte to byte
rewrite. Maybe we just haven't tried some of the configuration combination?

Maybe Other options?

Thank you for an advice.
Krzysztof

Reply via email to