Re: Ability to read from UTF-16 or UTF-32 encoded files?

2018-07-09 Thread Etienne Chauchot
Hi,
Just a little precision. TextIO actually already supports custom multi-bytes 
delimiter in place of new lines. See
TextIO#withDelimiter(byte[] delimiter)
Etienne
Le samedi 07 juillet 2018 à 16:15 -0700, Robert Bradshaw a écrit :
> Currently TextIO scans for newlines to find line (record) boundaries, but 
> this can occur as part of a character for
> UTF-16 or UTF-32. It could be certainly adapted to look for multi-byte 
> patterns (with the right offset) but this would
> be more complicated. 
> Fortunately, the default of UTF-8 handles non-western languages very well, 
> but an option to support other encodings
> would be welcome. 
> 
> On Sat, Jul 7, 2018 at 1:33 PM Harry Braviner  
> wrote:
> > Is there any reason that TextIO couldn't be expanded to read from UTF-16 or 
> > UTF-32 encoded files?
> > 
> > Certainly Python and Java strings support UTF-16, so there shouldn't be 
> > additional complication there.
> > 
> > This would make Beam able to process non-latin character sets more easily. 
> > It would also alleviate a bug I ran into
> > while doing the MinimalWordCount tutorial: the first dataset you find if 
> > you Google "shakespeare corpus" (http://lex
> > ically.net/wordsmith/support/shakespeare.html) is UTF-16 encoded. The first 
> > byte of each UTF-16 character gets
> > interpreted as a non-letter UTF-8 character, and the pipeline gives a 
> > letter count instead. However, I think being
> > able to handle non-western languages would be the far greater benefit from 
> > this.
> > 

Re: Ability to read from UTF-16 or UTF-32 encoded files?

2018-07-07 Thread Robert Bradshaw
Currently TextIO scans for newlines to find line (record) boundaries, but
this can occur as part of a character for UTF-16 or UTF-32. It could be
certainly adapted to look for multi-byte patterns (with the right offset)
but this would be more complicated.

Fortunately, the default of UTF-8 handles non-western languages very well,
but an option to support other encodings would be welcome.

>

On Sat, Jul 7, 2018 at 1:33 PM Harry Braviner 
wrote:

> Is there any reason that TextIO couldn't be expanded to read from UTF-16
> or UTF-32 encoded files?
>
> Certainly Python and Java strings support UTF-16, so there shouldn't be
> additional complication there.
>
> This would make Beam able to process non-latin character sets more easily.
> It would also alleviate a bug I ran into while doing the MinimalWordCount
> tutorial: the first dataset you find if you Google "shakespeare corpus" (
> http://lexically.net/wordsmith/support/shakespeare.html) is UTF-16
> encoded. The first byte of each UTF-16 character gets interpreted as a
> non-letter UTF-8 character, and the pipeline gives a letter count instead.
> However, I think being able to handle non-western languages would be the
> far greater benefit from this.
>


Ability to read from UTF-16 or UTF-32 encoded files?

2018-07-07 Thread Harry Braviner
Is there any reason that TextIO couldn't be expanded to read from UTF-16 or
UTF-32 encoded files?

Certainly Python and Java strings support UTF-16, so there shouldn't be
additional complication there.

This would make Beam able to process non-latin character sets more easily.
It would also alleviate a bug I ran into while doing the MinimalWordCount
tutorial: the first dataset you find if you Google "shakespeare corpus" (
http://lexically.net/wordsmith/support/shakespeare.html) is UTF-16 encoded.
The first byte of each UTF-16 character gets interpreted as a non-letter
UTF-8 character, and the pipeline gives a letter count instead. However, I
think being able to handle non-western languages would be the far greater
benefit from this.