The FileHandle.read method accepts a byte size argument but it is also supposed to work with multi-bytes encodings. At the moment, this is solved by returning a string with more bytes than requested if there happens to be partial multi-byte character at the end of the buffer. This can be surprising and is rather tricky to do correctly.

I also don't see many use cases for reading a minimum amount of bytes from a handle with a multi-byte encoding. It would be more useful to read a certain amount of characters. This can be implemented easily on top of my recent Unicode readline improvements.

I tried to simply change the read method to accept character sizes in branch nwellnhof/read_chars but that turned out to break Rakudo. AFAICS Rakudo calls the read method only in one place [1] and immediately converts the result to a ByteBuffer regardless of the current encoding. (This might return a larger buffer than requested if the encoding is set to the default utf8 for the reasons outlined above, which could be considered a bug.)

To support that use case I propose a new method 'read_bytes' that takes a byte size argument and returns a ByteBuffer. Once this is implemented, Rakudo and possibly other HLLs can switch over, and we change the 'read' method to accept character counts. Alternatively, we could introduce a new method 'read_chars', but the old 'read' method would be pretty much useless then.

I have no idea how this would affect other HLLs, so comments from HLL developers are especially welcome.

Nick

[1] https://github.com/rakudo/rakudo/blob/master/src/core/IO.pm#L82
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev

Reply via email to