FileHandle.read and multi-byte encodings

Nick Wellnhofer Fri, 07 Jan 2011 10:25:37 -0800

The FileHandle.read method accepts a byte size argument but it is alsosupposed to work with multi-bytes encodings. At the moment, this issolved by returning a string with more bytes than requested if therehappens to be partial multi-byte character at the end of the buffer.This can be surprising and is rather tricky to do correctly.

I also don't see many use cases for reading a minimum amount of bytesfrom a handle with a multi-byte encoding. It would be more useful toread a certain amount of characters. This can be implemented easily ontop of my recent Unicode readline improvements.

I tried to simply change the read method to accept character sizes inbranch nwellnhof/read_chars but that turned out to break Rakudo. AFAICSRakudo calls the read method only in one place [1] and immediatelyconverts the result to a ByteBuffer regardless of the current encoding.(This might return a larger buffer than requested if the encoding is setto the default utf8 for the reasons outlined above, which could beconsidered a bug.)

To support that use case I propose a new method 'read_bytes' that takesa byte size argument and returns a ByteBuffer. Once this is implemented,Rakudo and possibly other HLLs can switch over, and we change the 'read'method to accept character counts. Alternatively, we could introduce anew method 'read_chars', but the old 'read' method would be pretty muchuseless then.

I have no idea how this would affect other HLLs, so comments from HLLdevelopers are especially welcome.


Nick

[1] https://github.com/rakudo/rakudo/blob/master/src/core/IO.pm#L82
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev

FileHandle.read and multi-byte encodings

Reply via email to