Re: Ability to read from UTF-16 or UTF-32 encoded files?
Hi, Just a little precision. TextIO actually already supports custom multi-bytes delimiter in place of new lines. See TextIO#withDelimiter(byte[] delimiter) Etienne Le samedi 07 juillet 2018 à 16:15 -0700, Robert Bradshaw a écrit : > Currently TextIO scans for newlines to find line (record) boundaries, but > this can occur as part of a character for > UTF-16 or UTF-32. It could be certainly adapted to look for multi-byte > patterns (with the right offset) but this would > be more complicated. > Fortunately, the default of UTF-8 handles non-western languages very well, > but an option to support other encodings > would be welcome. > > On Sat, Jul 7, 2018 at 1:33 PM Harry Braviner > wrote: > > Is there any reason that TextIO couldn't be expanded to read from UTF-16 or > > UTF-32 encoded files? > > > > Certainly Python and Java strings support UTF-16, so there shouldn't be > > additional complication there. > > > > This would make Beam able to process non-latin character sets more easily. > > It would also alleviate a bug I ran into > > while doing the MinimalWordCount tutorial: the first dataset you find if > > you Google "shakespeare corpus" (http://lex > > ically.net/wordsmith/support/shakespeare.html) is UTF-16 encoded. The first > > byte of each UTF-16 character gets > > interpreted as a non-letter UTF-8 character, and the pipeline gives a > > letter count instead. However, I think being > > able to handle non-western languages would be the far greater benefit from > > this. > >
Re: Ability to read from UTF-16 or UTF-32 encoded files?
Currently TextIO scans for newlines to find line (record) boundaries, but this can occur as part of a character for UTF-16 or UTF-32. It could be certainly adapted to look for multi-byte patterns (with the right offset) but this would be more complicated. Fortunately, the default of UTF-8 handles non-western languages very well, but an option to support other encodings would be welcome. > On Sat, Jul 7, 2018 at 1:33 PM Harry Braviner wrote: > Is there any reason that TextIO couldn't be expanded to read from UTF-16 > or UTF-32 encoded files? > > Certainly Python and Java strings support UTF-16, so there shouldn't be > additional complication there. > > This would make Beam able to process non-latin character sets more easily. > It would also alleviate a bug I ran into while doing the MinimalWordCount > tutorial: the first dataset you find if you Google "shakespeare corpus" ( > http://lexically.net/wordsmith/support/shakespeare.html) is UTF-16 > encoded. The first byte of each UTF-16 character gets interpreted as a > non-letter UTF-8 character, and the pipeline gives a letter count instead. > However, I think being able to handle non-western languages would be the > far greater benefit from this. >
Ability to read from UTF-16 or UTF-32 encoded files?
Is there any reason that TextIO couldn't be expanded to read from UTF-16 or UTF-32 encoded files? Certainly Python and Java strings support UTF-16, so there shouldn't be additional complication there. This would make Beam able to process non-latin character sets more easily. It would also alleviate a bug I ran into while doing the MinimalWordCount tutorial: the first dataset you find if you Google "shakespeare corpus" ( http://lexically.net/wordsmith/support/shakespeare.html) is UTF-16 encoded. The first byte of each UTF-16 character gets interpreted as a non-letter UTF-8 character, and the pipeline gives a letter count instead. However, I think being able to handle non-western languages would be the far greater benefit from this.