Clearly an indentation-sensitive syntax has to reliably detect end-of-line. I think it's important, for users, that the reader reliably work even when you have line endings that aren't the standard for your platform. Especially if standards mandate their support!
Below is the current status and my current plan to make detecting line endings "just work", even in odd cases. Comments are welcome. I will *NOT* make any code changes to the reader right *now*. Alan Manuel Gloria is reorganizing all the reader code, and while git is good, it can't work miracles. But once he finishes reorganizing, I can easily implement DA PLAN below. --- David A. Wheeler === DA PLAN === The code already handles line-endings of LF (\n) and CRLF (\r\n), the Unix and MS-DOS/Windows conventions respectively. For most people, that's enough. But R6RS is more complicated than that. R6RS section 4.2.1 <http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.1> defines line ending as: <line ending> ::= <linefeed> | <carriage return> | <carriage return> <linefeed> | <next line> | <carriage return> <next line> | <line separator> While R6RS section 4.1 <http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.1> includes this definition (which defines the characters): Some non-terminal names refer to the Unicode scalar values of the same name: <character tabulation> (U+0009), <linefeed> (U+000A), <carriage return> (U+000D), <line tabulation> (U+000B), <form feed> (U+000C), <carriage return> (U+000D), <space> (U+0020), <next line> (U+0085), <line separator> (U+2028), and <paragraph separator> (U+2029). Misleadingly, the R6RS section 4.2.2 titled "Line endings" <http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.2> doesn't mention all these options. It mentions CRLF and LFCR, but not the IBM "next line" character. But I think the productions in 4.2.1 are intended to govern. So, here's what I plant to do. First, I plan to define the following as the end-of-line characters: <linefeed>, aka \n, U+000A <carriage return>, aka \r, U+000D <next line>, U+0085 <line separator>, U+2028 I plan for the readers to follow the following rules: * Lines end if ANY end-of-line character appears * To consume the eol, consume the first end-of-line character. If the next character is a DIFFERENT end-of-line character, consume that too. This way, \n\n is recognized as 2 lines, while \r\n and \n\r are recognized as one line. A \n without an end-of-line character after it ends the line, as expected. Weirdness like <carriage return> <next line> is recognized too. A few pairs that aren't required by R6RS would be considered a line-ending as well (e.g., <linefeed><next line>), but I think it's better to use this simple rule. It's sensible, and makes it more robust when dealing with odd text files. On complication: This would recognize U+0085 as <next line>. If the input contains data that read-char interprets as character U+0085, and it was not intended to be a next line, well, it'll be a next line now. But this doesn't appear to be likely. Users who use UTF-8 everywhere will have no issues, of course. Many other encodings, such as Latin-1, will have no problem as well; 8X is in the control character space for Latin-1 and I believe for all European encodings. And so on. I think this is unlikely to be an issue, and the advantage is that even unusual line-ending encodings will "just work". ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Readable-discuss mailing list Readable-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/readable-discuss