Hi Eli, (I'm getting lost where to reply, and how the subject gets mangled and the thread split into different ones.)
I've thought about it a lot, experimented with Emacs's behavior, and I've arrived at the conclusion that we are actually much closer to each other than I had thought. Probably there's a lot of misunderstanding due to different terminology we used. I've set my terminal to RTL paragraph direction (via the relevant escape sequence), then did a "cat TUTORIAL.he" (the file taken from 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical one, and the one running in a terminal of no BiDi. Apart from a few minor irrelevant differences, they look the same! Hooray!!! (The differences are: - I had to slightly modify TUTORIAL.he to make sure none of the lines start with a BiDi control (I added a preceding character) because currently VTE doesn't support them, there's no character cell to store this data. This definitely needs to be fixed in the second version of my proposal. - Emacs running in a terminal shows an underscore wherever there's a BiDi control in the source file – while the graphical one doesn't. This looks like a simple bug to me, right? - Line 1007, the copyright line of this file uses visual indentation, and Emacs detects LTR paragraph for that line. I think it should rather use BiDi controls to have an overall RTL paragraph direction detected, and within that BiDi controls to force LTR for the text. The terminal shows it with RTL direction, as I manually set it. Again, all these three details are irrelevant to my point, namely that in WIP gnome-terminal it looks the same as in Emacs.) You define paragraphs as emptyline-separated blocks on which you perform autodetection of the paragraph direction. This is great! As I've mentioned, I'd love to have such a mode in terminals, but it's subject to underlying improvements, like knowing when a prompt starts and ends, because prompts also have to be paragraph delimiters. You convinced me that it's much more important than I thought, thanks a lot for that! I will try to see if I can push for addressing the prerequisite issues sooner. Indeed I had to manually set RTL paragraph direction; with manual LTR or with per-line autodetection (as VTE can do now) the result would be much worse. Here's how the story continues from here. Here is where we misunderstood each other (or at the very least I misunderstood you), although we are talking about the same, doing things the same way: The BiDi algorithm takes a paragraph of text at a time, and somehow reshuffles its letters. UAX#9 section 3 starts by saying that the first main phase is separation into "paragraphs". What are those "paragraphs" that we're takling about _now_? The thing is, both in Emacs as well as in my specification, it's a logical line of the text (that is: delimited by single newlines). No, in these steps, when UBA is run, the paragraph is no longer defined as emptyline-delimited segments, it's defined as lines of the text. To recap: The _paragraph direction_ is determined in Emacs for emptyline-delimited segments of data, which I honestly find a great thing, and would love to do in terminals too, alas at this point it's blocked by some really nontrivial technical issues. But once you have decided on a direction, each _line_ within that data is passed separately to the BiDi algorithm to get reshuffled; this is what Emacs does, this is what my specification says, and this is the right thing. That is, for this step, the definition of "paragraph", as the BiDi algorithm uses this term, is a line of the text file. This is where I thought we had a disagreement, but we don't, we just misunderstood each other. ----- On a nitpicking side note: It's damn ugly not to terminate a text file with a newline. Newline is much better thought of a "terminator" than a "delimiter". For example, if you do a "cat file1 file2", you expect file2 to start on its own line. Shouldn't this apply to paragraphs, too, especially when BiDi is in the game? I'd argue that an empty line (double newline) shouldn't be a delimiter, it should be a terminator for a paragraph. I think "cat file1 file2" should make sure that the last paragraph of file1 and the first paragraph of file2 are printed as separate paragraphs (potentially with different paragraph direction), shouldn't it? I'd argue that if a text file is formatted like TUTORIAL.he, with empty lines denoting paragraph boundaries, then it should also end in an empty line (that is: two newline characters). ----- Feel free to skip the rest :) Let's make a thought experiment. Let's assume that for running the BiDi algorithm, we'd still stick to the emptyline-delimited paragraph definition. This is not what you do, this is not what I do, but I misunderstood that this is what you did, and I also thought this was a good idea as a potential extension for the BiDi specs – I no longer think so. This definition is truly problematic, as I'll show below. The BiDi algorithm takes paragraphs of text, shuffles them, and somewhere in the middle, with cooperation with the caller, cuts into lines. It doesn't say a single word about the input potentially being cut into lines, how it would handle them, how they would interfere with the line breaks that the caller of the algorithm decides to add etc. It makes sense: the BiDi algorithm converts a logical text into a visual one, whereas single newlines within a paragraph would already be visual elements, so the input string would be a mixture of the two worlds (which probably doesn't make any sense per se). Let's assume that the message I want to deliver is, written in its logical order (left to right), is: abc DEFGHIJKLM NOPQ rstuvwxyz For whatever reason (e.g. I'd prefer to keep a 15 column margin in the source file) it's split into two lines, that is, in the middle that's a newline rather than a space: abc<space>DEFGHIJKLM<newline>NOPQ<space>rstuvwxyz A completely non-BiDi application would show the contents as abc DEFGHIJKLM NOPQ rstuvwxyz If you run the BiDi algorithm on this unit as a whole paragraph, it would not handle newline any differently from a space. It sees one continous run of RTL text consisting of two words with a newline in between, and reverses their order: abc<space>QPON<newline>MLKJIHGFED<space>rstuvwxyz Which would show up like this in a proper BiDi-aware viewer: abc QPON MLKJIHGFED rstuvwxyz I can see two significant problems with this. One is that because it can shuffle characters around the newline, it breaks the principle that the eyes never have to move upwards. The second is that the margin of 15 characters is no longer preserved. The visual character (newline) no longer serves the visual purpose it served in the logical order. Especially in terminals this could cause a whole bunch of troubles. E.g. when an application believes that printing some stuff moved the cursor down by 2 lines, it might have actually moved it by 3 (if the terminal's overall width is also 15-ish, in this example). It's unclear how cursor positions, mouse click positions (including on the "unused" area after the end of each line) could be mapped, and so on. It's such a complex area that I really wouldn't like to continue in this direction even if it was a correct one, which luckily it isn't. (I vaguely recall, from about a decade ago, that – presumably for reasons along these lines – browsers have a huge problem with "<br>" inside a paragraph when it comes to BiDi. I don't know where they stand now, I'll investigate if it's important, but I don't think it is.) Luckily both Emacs and my specification shuffles the contents separately within both lines (using LTR paragraph for both lines, as it's guessed from the union of them), resulting in the desired: abc MLKJIHGFED QPON rstuvwxyz Does this all make much more sense now? :) cheers, egmont On Tue, Feb 5, 2019 at 5:09 PM Eli Zaretskii via Unicode <[email protected]> wrote: > > > Date: Tue, 5 Feb 2019 00:05:47 +0000 > > From: Richard Wordingham via Unicode <[email protected]> > > > > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited > > > > by paragraph separator characters. This means characters whose bidi > > > > category is B, which includes Newline, the CR-LF pair on Windows, > > > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. > > > > It actually gives two different definitions. Table UAX#9 4 restricts > > the type B to *appropriate newline functions; not all newlines are > > paragraph separators. > > For what exactly is "appropriate newline function" one should read the > Unicode Standard, section 5.8. My conclusions from that are different > from yours; see below. > > > > Indeed, this was an oversight on my side. So, with this definition, > > > every single newline character starts a new paragraph. The result of > > > printf "Hello\nWorld\n" > world.txt > > > is a text file consisting of two paragraphs, with 5 characters in > > > each. Correct? > > > > No, it depends on when a newline function is 'appropriate'. TUS 5.8 > > Rule R2b applies - 'In simple text editors, interpret any NLF the same > > as LS'. > > That's not all of what the Standard says. Just a couple of paragraphs > above Rule R2b, there's this text: > > Note that even if an implementer knows which characters represent > NLF on a particular platform, CR, LF, CRLF, and NEL should be > treated the same on input and in interpretation. Only on output is > it necessary to distinguish between them. > > So in practice, IMO the above example does constitute 2 paragraphs, > regardless of the underlying platform's conventions.

