Clearly an indentation-sensitive syntax has to reliably detect end-of-line.  I 
think it's important, for users, that the reader reliably work even when you 
have line endings that aren't the standard for your platform.  Especially if 
standards mandate their support!

Below is the current status and my current plan to make detecting line endings 
"just work", even in odd cases.  Comments are welcome.

I will *NOT* make any code changes to the reader right *now*. Alan Manuel 
Gloria is reorganizing all the reader code, and while git is good, it can't 
work miracles.  But once he finishes reorganizing, I can easily implement DA 
PLAN below.

--- David A. Wheeler



=== DA PLAN ===

The code already handles line-endings of LF (\n) and CRLF (\r\n), the Unix and 
MS-DOS/Windows conventions respectively.  For most people, that's enough.

But R6RS is more complicated than that. R6RS section 4.2.1 
<http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.1> defines 
line ending as:
<line ending> ::= <linefeed> | <carriage return>
         | <carriage return> <linefeed> | <next line>
         | <carriage return> <next line> | <line separator>

While R6RS section 4.1 
<http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.1> includes 
this definition (which defines the characters):
Some non-terminal names refer to the Unicode scalar values of the same name: 
<character tabulation> (U+0009), <linefeed> (U+000A), <carriage return> 
(U+000D), <line tabulation> (U+000B), <form feed> (U+000C), <carriage return> 
(U+000D), <space> (U+0020), <next line> (U+0085), <line separator> (U+2028), 
and <paragraph separator> (U+2029).

Misleadingly, the R6RS section 4.2.2 titled "Line endings" 
<http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.2> doesn't 
mention all these options. It mentions CRLF and LFCR, but not the IBM "next 
line" character.  But I think the productions in 4.2.1 are intended to govern.

So, here's what I plant to do.  First, I plan to define the following as the 
end-of-line characters:
  <linefeed>, aka \n, U+000A
  <carriage return>, aka \r, U+000D
  <next line>, U+0085
  <line separator>, U+2028

I plan for the readers to follow the following rules:
* Lines end if ANY end-of-line character appears
* To consume the eol, consume the first end-of-line character.  If the next 
character is a DIFFERENT end-of-line character, consume that too.

This way, \n\n is recognized as 2 lines, while \r\n and \n\r are recognized as 
one line. A \n without an end-of-line character after it ends the line, as 
expected.  Weirdness like <carriage return> <next line> is recognized too.  A 
few pairs that aren't required by R6RS would be considered a line-ending as 
well (e.g., <linefeed><next line>), but I think it's better to use this simple 
rule.  It's sensible, and makes it more robust when dealing with odd text files.

On complication: This would recognize U+0085 as <next line>.  If the input 
contains data that read-char interprets as character U+0085, and it was not 
intended to be a next line, well, it'll be a next line now.  But this doesn't 
appear to be likely.  Users who use UTF-8 everywhere will have no issues, of 
course.  Many other encodings, such as Latin-1, will have no problem as well; 
8X is in the control character space for Latin-1 and I believe for all European 
encodings. And so on. I think this is unlikely to be an issue, and the 
advantage is that even unusual line-ending encodings will "just work".


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Readable-discuss mailing list
Readable-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/readable-discuss

Reply via email to