Hello,
Problem: Special characters such as öüä are being save to our sinks are "?".
Set up: We read from Kafka using Kafka IO, run the Pipeline with
DataFlow Runner and save the results to BigQuery and ElasticSearch.
We checked that data is being written to Kafka in UTF-8 (code check). We
checked also that the special characters appear using
kafka-console-consumer.
Something else is that in a local setup, with Kafka in docker* and using
Direct Runner, the character was correctly encoded. *the event was
writen using kafka-console-producer.
Reading from Kafka:
pipeline
.apply(
"ReadInput",
KafkaIO.<String, String>read()
.withBootstrapServers(...)
.withTopics(...)
.updateConsumerProperties(...) // only"group.id"
and"auto.offset.reset"
.withValueDeserializer(StringDeserializer.class)
.withCreateTime(Duration.standardMinutes(10))
.commitOffsetsInFinalize())
So, any clues on where to investigate? In the mean time I'm going to add
more logging to the application to see if I can detect where the
characters get "lost" in the pipeline. Also try to write to local Kafka
using a Java Kafka Producer where I can be sure it is written in UTF-8.
Thank you for the support.