I just spent the past two days debugging a character corruption issue in a
Dataflow pipeline. It turned out that we had encoded a json object to a
string and then called getBytes() without specifying a charset. In our
testing infrastructure, this didn't cause a problem because the default
charset on the system was UTF-8. Whatever the default charset is on
Dataflow workers, it is apparently not UTF-8.

The main lesson here is to be very careful about always specifying a
charset when encoding and decoding strings. But, it would be nice to
protect ourselves from this problem in the future.

Is there any way for users to specify environment variables and/or Java
system properties when deploying a pipeline to Dataflow such that those
settings are in effect on all workers? I'd like to ensure UTF-8 is the
default charset throughout the pipeline on any system.

Reply via email to