Re: readln() returns new line charater

Dmitry Olshansky Sun, 29 Dec 2013 11:46:29 -0800

29-Dec-2013 23:28, Vladimir Panteleev пишет:

On Sunday, 29 December 2013 at 18:45:36 UTC, Dmitry Olshansky wrote:

I've come to conclusion that the only sane line ending behavior is to
do what Unicode standard says, and detect the following pattern as
line separator:


\r\n | \r | \f | \v | \n | \u0085 | \u2028 | \u2029

This includes never breaking a line in between \r\n sequence.


I don't think something as basic as a line-splitting function should do
UTF decoding unless the user asks for it explicitly.


I haven't said decode :)

Just match the pattern as UTF-8 bytes explicitly, the bulk of theseseparators is side-steped away after a single test instruction +conditional branch (that is fairly predictable - like almost never taken).

Getting UTF-8
decoding errors in splitLines when working with ASCII files has caused
be enough frustration to stop using that function altogether (unless I
*KNOW* the text is valid UTF-8). I've yet to encounter a need to split
by anything other than \n and \r\n.

I would argue there is a way to do that almost as cheap as the trio of\r | \n | \r\n would be. Personal experience notwithstanding it would bebetter do the right thing.

P.S. What I know for sure is that there is a strong need for havingbetter support for other encodings. Raw ASCII included, but encodingassumptions must be explicit.


--
Dmitry Olshansky

Re: readln() returns new line charater

Reply via email to