Ability to read from UTF-16 or UTF-32 encoded files?

Harry Braviner Sat, 07 Jul 2018 13:34:01 -0700

Is there any reason that TextIO couldn't be expanded to read from UTF-16 or
UTF-32 encoded files?


Certainly Python and Java strings support UTF-16, so there shouldn't be
additional complication there.

This would make Beam able to process non-latin character sets more easily.
It would also alleviate a bug I ran into while doing the MinimalWordCount
tutorial: the first dataset you find if you Google "shakespeare corpus" (
http://lexically.net/wordsmith/support/shakespeare.html) is UTF-16 encoded.
The first byte of each UTF-16 character gets interpreted as a non-letter
UTF-8 character, and the pipeline gives a letter count instead. However, I
think being able to handle non-western languages would be the far greater
benefit from this.

Ability to read from UTF-16 or UTF-32 encoded files?

Reply via email to