29-Dec-2013 23:28, Vladimir Panteleev пишет:
On Sunday, 29 December 2013 at 18:45:36 UTC, Dmitry Olshansky wrote:
I've come to conclusion that the only sane line ending behavior is to
do what Unicode standard says, and detect the following pattern as
line separator:

\r\n | \r | \f | \v | \n | \u0085 | \u2028 | \u2029

This includes never breaking a line in between \r\n sequence.

I don't think something as basic as a line-splitting function should do
UTF decoding unless the user asks for it explicitly.

I haven't said decode :)
Just match the pattern as UTF-8 bytes explicitly, the bulk of these separators is side-steped away after a single test instruction + conditional branch (that is fairly predictable - like almost never taken).

Getting UTF-8
decoding errors in splitLines when working with ASCII files has caused
be enough frustration to stop using that function altogether (unless I
*KNOW* the text is valid UTF-8). I've yet to encounter a need to split
by anything other than \n and \r\n.

I would argue there is a way to do that almost as cheap as the trio of \r | \n | \r\n would be. Personal experience notwithstanding it would be better do the right thing.

P.S. What I know for sure is that there is a strong need for having better support for other encodings. Raw ASCII included, but encoding assumptions must be explicit.

--
Dmitry Olshansky

Reply via email to