Hi Deri, At 2024-02-06T21:35:05+0000, Deri wrote: > Many thanks for your thoughts on my code. I shall reply in general > terms since your grasp of some of the issues is rather hazy, as you > admit.
I generally don't feel I grasp code of nontrivial complexity until I've documented it and written tests for it, and often not even then. I'm a bear of very little brain! > Huge AGL lookup table > > My least favourite solution, but you made me do it! The most elegant > and efficient solution was to make a one line amendment to afmtodit > which added an extra column to the groff font files which would have > the UTF-16 code for that glyph. This would only affect devpdf and > devps and I checked the library code groff uses to read its font files > was not affected by an extra column. I also checked the buffer used > would not cause an overflow. Despite this, you didn't like this > solution, without giving a cogent reason, but suggesting a lookup > table! I remember expressing unease with the "new column" approach, not rejection. The main reason is the documented format of the lines in question. groff_font(5): The directive charset starts the character set subsection. (On typesetters, this directive is misnamed since it starts a list of glyphs, not characters.) It precedes a series of glyph descriptions, one per line. Each such glyph description comprises a set of fields separated by spaces or tabs and organized as follows. name metrics type code [entity‐name] [-- comment] [...] The entity‐name field defines an identifier for the glyph that the postprocessor uses to print the troff glyph name. This field is optional; it was introduced so that the grohtml output driver could encode its character set. For example, the glyph \[Po] is represented by “£” in HTML 4.0. For efficiency, these data are now compiled directly into grohtml. grops uses the field to build sub‐encoding arrays for PostScript fonts containing more than 256 glyphs. Anything on the line after the entity‐name field or “--” is ignored. The presence of 2 adjacent optional fields seems to me fairly close to making the glyph descriptions formally undecidable. In practice, they're not, until and unless someone decides to name their "entity" "--"... (We don't actually tell anyone they're not allowed to do that.) As I understand it, this feature is largely a consequence of the implementation of grohtml 20-25 years ago, where an "entity" in HTML 4 and XHTML 1 was a well-defined thing. We might do well to tighten the semantics and format of this optional fifth field a bit more. More esteemed *roffers than I have stumbled over our documentation's unfortunate tendency to sometimes toss the term "entity" around, unmoored from any formal definition in the *roff language. https://lists.gnu.org/archive/html/groff/2023-04/msg00002.html While I'm complaining about hazy terminology that exacerbates my hazy understanding of things, I'll observe that I don't understand what the verb "to sub-encode" means. I suspect there are better words to express what this is trying to get across. If I understood what grops was actually doing here, I'd try to find those words. > As to whether I should embed the table, or read it in, I deferred to > the more efficient method use by afmtodit, embed it as part of make. I > still would prefer the extra column solution, then there is no lookup > at all. I don't object to the idea, but I think our design decisions should be documented, and it frequently seems to fall to me to undertake the documentation. That means I have to ask a lot of questions, which programmers sometimes interpret as critique. (And, to be fair, sometimes I actually _do_ have critiques of an implementation.) > use_charnames_in_special > > Probably unnecessary once you complete the work to return .device to > its 1.23.0 condition, as you have stated. That seems like a fair prediction. Almost all of the logic _on the formatter side_ that employs this parameter seems to be in one function, `encode_char()`. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n5427 (Last month, I renamed that to `encode_char_for_troff_output()` and I'm thinking it can be further improved, more like `encode_char_for_device_control()`... ...there's just one more thing. There's one other occurrence, in a constructor. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n293 I look forward to someday understanding what that's doing there.) > pdfmomclean > > Not quite sure how your work on #64484 will affect this, we will have > to wait and see. Fair enough. > Stringhex > > Clearly you are still misunderstanding the issue, because there are > some incorrect statements. Okay. > In any lookup there is a key/value pair. I'm with ya so far. > If dealing with a document written in Japanese, both the key and the > value will arrive as unicode. No problem for the value, but the key > will be invalid if used as part of a register name. Yes. I was trying to say the same thing in the mail to which you're replying. > There are two obvious solutions. One is to encode the key into > something, easily decoded, which is acceptable to be used as part of a > register name, or do a loop traversal over two arrays, one holding the > keys and one the values. Yes. > I'm pretty sure my 9yr old grandson would come up with a looping > solution. In my opinion, your grandson has good instincts if he avoids implementing things like "an:cln", "sanitize.tmac", or my own "an*abbreviate-page-topic". > I really don't understand your opposition to the encoding solution, If I'm debugging using troff and dump the string/macro list, then I envision it being disheartening to see something like this. .pm PDFLB 9 pdfswitchtopage 32 pdfnote 380 pdf:note-T 57 pdfpause 29 PDFBOOKMARK.VIEW 21 pdf:look(0073007500700065007200630061006c006900660072006100670069006c0069007300740069006300650078007000690061006c00690064006f00630069006f007500732602) 41 pdfmark 31 pdftransition 58 pdfbackground 40 pdfpagenumbering 37 pdfbookmark 1677 Whereas with the 9 year old youth's solution, I get something more like this. .pnr pdf:bm.nl 0 PDFOUTLINE.FOLDLEVEL 10000 PDFNOTE.WIDTH 252000 PDFNOTE.HEIGHT 144000 PDFHREF.VIEW.LEADING 5000 PDFHREF.LEADING 2000 pdf:look.id!1 1 .pm PDFLB 9 pdfswitchtopage 32 pdfnote 380 pdf:note-T 57 pdfpause 29 PDFBOOKMARK.VIEW 21 pdf:look.content!1 41 pdfmark 31 pdftransition 58 pdfbackground 40 pdfpagenumbering 37 pdfbookmark 1677 .tm \*[pdf:look.content!1] supercalifragilisticexpialidocious\[u2602] I think there are advantages to not having to read, copy and paste, or--God forbid, type--something like pdf:look(0073007500700065007200630061006c006900660072006100670069006c0069007300740069006300650078007000690061006c00690064006f00630069006f007500732602). Not to mention having to come up with an answer when someone asks me how to decode that at a shell prompt. > Ok, I accept you would have done it the childs way with the > performance hit, but I prefer the more elegant encoding solution. I agree that O(1) is generally better than O(n). But there are important things to measure besides the asymptotic runtime behavior of this one aspect of the system. In my experience, systems that are easier to troubleshoot are more pleasant to use than ones that aren't. Further, by not measuring the performance impact of encoding and decoding all those character-string to hexadecimal-character-string conversions, we're not telling the full story. > Uniqueness of keys is an issue for either strategy. In mom, a user > supplied key name is only possible by using the NAMED parameter, and > if a user uses the same name twice in the document nothing nasty will > happen, the overview panel will be correct, since each of those is > tagged with a safe generated name, and if they have used the same name > for two different places in the document, when they are checking all > the intra-document links they will find one of them will go to the > wrong place. Of course this could be improved by warning when the same > name is provided for a different destination. The man/mdoc macros > currently have no named destinations, all generated, but this will > change if the mdoc section referencing is implemented. Yes, though that's down the road a little way. First I just want to get hyperlinks working for `Mt`, `Lk`, and `Xr`. But it's an obvious concern for man pages. I'll _have_ to solve the problem because man page section heading titles are so rigidly prescribed. Format multiple man(7) (and mdoc(7)) documents and you'll have guaranteed collisions. Prefixing the anchors/bookmarks with the page identifier seems seems tractable-- <a name="groff(1)/Options"> --for instance. Even that's not perfect--with a large enough agglomeration of pages you could have more than one groff(1) page--but it might be close enough. For the pathological cases we can support a user-specified prefix string, maybe. > You mention a possible issue if a diversion is passed to stringhex, > since this is 95% your own code for stringup/down, I'm pretty sure > that whatever you do to solve the issue in your own code can be > equally applied to stringhex, so this not an argument you can use to > prevent its inclusion. I wasn't making an argument to "prevent its inclusion". I was and am studying and making observations, and if those point out oversights in my own work as well as others', that's a net benefit because it _gets the information out into the world_. The point of collaborative software development is to get synergetic benefits from multiple brains working on the same problem. I hope you have found my little brain to have contributed something from time to time. > As regards your point 2, this is a non-issue, in 1.23.0 it works fine > with .device. You ask what does:- > > \X'pdf: bizzarecmd \[u1234]' > > Mean? Well, assuming you are writing in the ethiopic language and > wrote:- > > \X'pdf: bizzarecmd ሴ' > > And gropdf would do a "bizzarecmd" using the CHARACTER given (ETHIOPIC > SYLLABLE SEE). Which could be setting a window title in the pdf > viewer, I'm not sure, I have not written a handler for bizzarecmd. That is the point I am trying to make. It is totally up to the postprocessor (or output driver) how to interpret "\[u1234]". > As you can see not "misleading to a novice" at all, the fact that > preconv changed it to be a different form and gropdf changed it back > to a character to use in pdf meta- data is completely transparent to > the user. I guess this is a matter of perspective, but the fact that preconv, troff, and gropdf are all separate programs is significant to me. In my opinion, that fact means that we need to document the interfaces between these components. The "\[uXXXX]" that a person puts in an input document can undergo transformation. (1) preconv might apply a Unicode normalization form to it; (2) the formatter might honor user instructions to translate or replace the special character, and, in the future, it might participate in ligature replacement; (3) the output device will of course do whatever it does to make a visible grapheme. > Your work on \X and .device is to put .device back to how it was in > 1.23.0 and alter \X to be the same, this is what you said would > happen. Yup. That's still my plan. It's a bit of a slog to get through all of the corner cases. In case it is of interest, I'm attaching two pieces of absolutely not fully baked code from my Git stash and a working copy, respectively, to illustrate my progress. Perhaps you would like the opportunity to repay me in code review coin. I present my underbelly at its softest! :-O Having gotten through all of that, I find myself unable to locate where you explicitly identified any of my statements about stringhex as incorrect, which is an expectation you set up. We agree that one approach is O(1) and another is O(n). We agree that the key uniqueness problem is unsolved by either one. You think the solution that I lean toward is childish, and I think the solution you lean toward is less helpful for people using GNU troff(1) to debug documents or macro files. But those are matters of opinion, not correctness of claims. Please more explicitly flag my mistaken/false statements so that I can either rebut your characterization, or learn something from you. > The purpose of my patch was intended to give Robin a robust solution > to what he wanted to do. Yes. That's fine--I simply wanted to take the opportunity to try and keep things moving forward with the gropdf-ng merge, a long and difficult process. And if he's happy, I'm happy--see below regarding my trust in you to get the problem solved. I hope you haven't overlooked that I was thrilled with your recent commits that make the PDF-hyperlinkification of the collected groff man pages possible. That was so compelling that I put other things on hold to get Savannah #61434 done. <https://savannah.gnu.org/bugs/?61434> > You wrote in another email:- > > "But tparm(const char *str, long, long, long, long, long, long, long, > long, long) is one of the worst things I've ever seen in C code. > > As I just got done saying (more or less) to Deri, when you have to > obfuscate your inputs to cram them into the data structure you're > using, that's a sign that you're using the wrong data structure." > > I don't appreciate being conflated with "the worst things". I wasn't attempting to liken anything you've written to the tparm thing. But even if I were, that would be an assessment of _code_ rather than of your quality as a person or as a programmer. I probably sound like a stereotypical political liberal when I claim that "bad" code comes from bad environments, not bad programmers. In my opinion that really is the first order factor. People are trainable; programmers improve as they practice. "Bad" code is often the result of a poor fit for the tool with the problem, and it's sadly typical that a programmer does not have the breadth of choice in tools that they would prefer. They might not have learned of a well-suited alternative; no one has mastered everything. They may be working under a management mandate to employ certain technology, or operating under constraints that render "better" tooling unsupportable in the environment in question. The pressures of cost minimization can knock things out of reach in the product phase that were easily sustained during research and development. Similarly, the *roff language imposes constraints on us. I suspect you and I both would be reaching for Perl-like associative arrays to handle the bookmark storage and lookup issue if *roff had them to offer us. We should always be on the lookout for better solutions. Languages, tools, and expertise evolve. This is what refactoring is all about. Furthermore, we (or at least I) learn more about code by taking a specimen that works and re-expressing it than by virtually any other method. When we have a good automated test suite to establish that we haven't regressed it, this can be "refactoring" at its best. If the end result is smaller, clearer, and/or more performant code, it's a win. > The structure we were discussing was a simple key/value pair. Yes, but not so simple in *roff. *roff identifiers are so "liberal in what they expect" that people get seduced into thinking you can put anything into them, which leads to episodes like Savannah #64202. > I have noticed that you tend to call things obfuscated when you have > difficulty understanding them. Once again, I confess that I am a bear of very little brain. But I also figure that even if I'm roughly as sharp as the average person on this list--if I may make such a boast even hypothetically-- then plenty of other people will have at least as much difficulty as I. > Encoding a key into a well known address space (the hex numbers) is > not obfuscation. V pbaprqr gung lbh znl or noyr gb ernq n urknqrpvzny rapbqvat bs NFPVV jvgu sne terngre rnfr guna V pna. Ohg vs abg, gura V fhttrfg gung lbh erpbafvqre lbhe pynvz urer. If you're old enough to have grandchildren I trust that you recognize that form of non-obfuscation immediately. And USENET still lives... > Consider BASE64 in mail systems, is that obfuscation or a valid method > of protecting an essentially ascii environment, It was an expedient for transmitting 8-bit data payloads across non-8-bit clean data communications channels. Would Base64 or UU encoding ever have been developed if 8-bit clean channels had been ubiquitous in the first place? I think you're inferring value judgments from my observations that you're not warranted in making. The bookmark-encoding-in-an-identifier problem sucks, that's for sure. I don't blame anyone for doing what they can to escape its constraints. But I think it's worth exploring the space of solutions available to us. Recall that Base64 has costs in time and space. And of course, most people can't just read a Base64-encoded email as-is, that is by, say, catting or grepping their inbox. These are downsides we should be forthright about. > to my mind its a sensible compromise in systems designed before > unicode (rings a bell?). I'm not passionate about Base64 one way or the other. If the day arrives when we no longer need it for anything because the problems that motivated its development disappear, then I wouldn't mourn its demise. It's an instrument, as stringhex, or some other solution to the same problem, would be. > I'm afraid you give the impression that your ideas on how I should do > my voluntary contribution to groff have more weight than my own, is > that how you see it? Not at all. We've had conflict over this point before, so I will ask again whether you would prefer that gropdf be in the contrib/ directory. Peter Schaffter, for instance, has made a simple statement of where he regards the boundaries of mom to be even within the contrib/mom directory; he's not terribly concerned with, as I recall, asserting control over the Automake script or the groff_mom(7) man page, both of which originated with others in the first place. So, I pretty much keep my fingers out of om.tmac. (I think I've run some one-liner changes, at about the level of typo fixes, by him on one or two occasions, in response to Savannah tickets. I simply don't grasp the package well enough to do more.) ...but, Peter is also wholly responsible for mom's documentation. If you leave to others the task of writing the documentation for your contributions, they are (I am) going to study them as closely as they (I) need to to try to communicate their function to third parties. Specifically, when I study them closely enough, I start thinking like an engineer again instead of just a technical writer. And I don't think that's unwarranted since I serve both roles in this project. Further, if a code module lives in the non-contrib part of the groff source distribution, I think the strong implication is that responsibility for its maintenance falls to the groff team as a whole. (I'll grant that the distinction is less clear than it could be, because groff has relatively few active contributors and a lot of the stuff in the contrib/ directory is maintained either by me or by no one at all, because the original authors have wandered away over the past 20 years.) > I welcome people who can find issues with my code, and by issues I > mean if it produces output other than intended, fails on edge cases I > have not considered, falls over given valid input. Certainly. If you didn't welcome that sort of feedback I think you'd be neglecting your responsibilities as a software developer. Fortunately that scenario is counterfactual. > I am quite sure there will be "bugs" in my code, it is fairly complex, > but subjecting it to a "code review" without even running it to see if > it does what it says on the box, is not helpful. I think you've pretty badly mistaken my perspective. One of the reasons I stick my long nose into your code in this way is because I don't worry that you won't produce correct results. You have an established record of delivering solutions that work as advertised. (None of us is perfect, but you do as well as anyone, as far as I can tell. It wouldn't surprise me in the least if your defect rate were lower than mine. I depend heavily on iterative development processes and automated testing to protect myself from faceplants, and I still suffer those occasionally. Here's a recent example <https://savannah.gnu.org/bugs/?65225>. We/I didn't have a regression test for tbl's '\R' feature, and I got bitten. I overlooked the forest for the wood lice of invalid inputs. We have that regression test now.) Regards, Branden
diff --git a/src/roff/groff/tests/device-control-special-character-handling.sh b/src/roff/groff/tests/device-control-special-character-handling.sh index d5f37c1d9..f5b10a393 100755 --- a/src/roff/groff/tests/device-control-special-character-handling.sh +++ b/src/roff/groff/tests/device-control-special-character-handling.sh @@ -19,6 +19,7 @@ # groff="${abs_top_builddir:-.}/test-groff" +troff="${abs_top_builddir:-.}/troff" fail= @@ -28,69 +29,84 @@ wail () { } input='.nf -\X#bogus1: esc \%man-beast\[u1F63C]\[u1F00] -\[aq]\[dq]\[ga]\[ha]\[rs]\[ti]# -.device bogus1: req \%man-beast\[u1F63C]\[u1F00] -'"'"'"`^\\~ -.device "wack1: req \%man-beast\[u1F63C]\[u1F00] -'"'"'"`^\\~ -.device " altwack -.\" We must break or flush here before changing the escape character. -.fl -.ec @ -@X#bogus2: esc @%man-beast@[u1F63C]@[u1F00] -@[aq]@[dq]@[ga]@[ha]@[rs]@[ti]# -.device bogus2: req @%man-beast@[u1F63C]@[u1F00] -'"'"'"`^\~ -.device "wack2: req @%man-beast@[u1F63C]@[u1F00] -'"'"'"`^\~' - -output=$(printf '%s\n' "$input" | "$groff" -T ps -Z 2> /dev/null \ +.device request 1: !"#$%&'"'"'()*+,-./0123456789:;<=>? + \X@ escape 1: !"#$%&'"'"'()*+,-./0123456789:;<=>?@ +.device request 2: @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ + \X! escape 2: @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_! +.device request 3: `abcdefghijklmnopqrstuvwxyZ{|}~ + \X! escape 3: `abcdefghijklmnopqrstuvwxyZ{|}~!' + +output=$(printf '%s\n' "$input" | "$troff" -R -T ps 2> /dev/null \ | grep '^x X') -error=$(printf '%s\n' "$input" | "$groff" -T ps -Z 2>&1 > /dev/null) +error=$(printf '%s\n' "$input" | "$troff" -R -T ps 2>&1 > /dev/null) + +echo "checking printable ASCII repertoire" >&2 echo "$output" echo "$error" -echo "checking X escape sequence, default escape character" >&2 -# x X bogus1: esc man-beast -'"`^\~ echo "$output" \ - | grep -Fqx 'x X bogus1: esc man-beast -'"'"'"`^\~' \ + | grep -Fqx 'x X request 1: !"#$%&'"'"'()*+,-./0123456789:;<=>?' \ || wail - -echo "checking device request, default escape character" >&2 -# x X bogus1: req man-beast\[u1F63C]\[u1F00] -'"`^\~ echo "$output" \ - | grep -Fqx 'x X bogus1: req man-beast\[u1F63C]\[u1F00] -'"'"'"`^\~' \ + | grep -Fqx 'x X escape 1: !"#$%&'"'"'()*+,-./0123456789:;<=>?' \ || wail - -echo "checking same, but with leading quote" >&2 -# x X bogus1: req man-beast\[u1F63C]\[u1F00] -'"`^\~ echo "$output" \ - | grep -Fqx 'x X wack1: req man-beast\[u1F63C]\[u1F00] -'"'"'"`^\~' \ + | grep -Fqx 'x X request 2: @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_' \ || wail - -echo "checking leading quote and space behavior in device request" >&2 -# x X altwack -# 2 spaces between X and altwack -echo "$output" | grep -Fqx 'x X altwack' || wail - -echo "checking X escape sequence, alternate escape character" >&2 -# x X bogus2: esc man-beast -'"`^\~ echo "$output" \ - | grep -Fqx 'x X bogus2: esc man-beast -'"'"'"`^\~' \ + | grep -Fqx 'x X escape 2: @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_' \ || wail - -echo "checking device request, alternate escape character" >&2 -# x X bogus2: req man-beast\[u1F63C]\[u1F00] -'"`^\~ echo "$output" \ - | grep -Fqx 'x X bogus2: req man-beast\[u1F63C]\[u1F00] -'"'"'"`^\~' \ + | grep -Fqx 'x X request 3: `abcdefghijklmnopqrstuvwxyZ{|}~' \ || wail - -echo "checking same, but with leading quote" >&2 -# x X bogus1: req man-beast\[u1F63C]\[u1F00] -'"`^\~ echo "$output" \ - | grep -Fqx 'x X wack2: req man-beast\[u1F63C]\[u1F00] -'"'"'"`^\~' \ + | grep -Fqx 'x X escape 3: `abcdefghijklmnopqrstuvwxyZ{|}~' \ || wail -# TODO: Uncomment this when \X learns to read its argument in copy mode. -#echo "checking that no error diagnostics were thrown" >&2 -#test -z "$error" || wail +# Confirm translation of a groff special character escape sequence to a +# basic Latin character when used in a device control command. +# +# $1 is the special character escape _without_ the leading backslash. +# $2 is the expected output character _shell-quoted as necessary_. +# $3 is a human-readable glyph description for the test log. +check_char () { + sc=$1 + output=$2 + format='checking conversion of "\\%s" to "%s" in device control' + format="$format escape sequence\n" + printf "$format" "$sc" "$output" "$description" >&2 + if ! printf '\\X#\\%s %s#\n' "$sc" "$desc" | "$troff" -R -T ps \ + | grep -Fqx 'x X '$output + then + printf '...FAILED' >&2 + fail=yes + fi + + format='checking conversion of "\\%s" to "%s" in device control' + format="$format request\n" + printf "$format" "$sc" "$output" "$description" >&2 + if ! printf '.device %s %s\n' "$sc" "$desc" | "$troff" -R -T ps \ + | grep -Fqx 'x X '$output + then + printf '...FAILED' >&2 + fail=yes + fi +} + +check_char ' ' ' ' +#check_char - - +#check_char '[aq]' "'" "neutral apostrophe" +#check_char '[dq]' '"' "double quote" +#check_char '[ga]' '`' "grave accent" +#check_char '[ha]' ^ "caret/hat" +#check_char '[rs]' '\' "reverse solidus/backslash" +#check_char '[ti]' '~' "tilde" test -z "$fail" +## TODO: Uncomment this when \X learns to read its argument in copy mode. +##echo "checking that no error diagnostics were thrown" >&2 +##test -z "$error" || wail + # vim:set autoindent expandtab shiftwidth=2 tabstop=2 textwidth=72: diff --git a/src/roff/groff/tests/some_escapes_accept_newline_delimiters.sh b/src/roff/groff/tests/some_escapes_accept_newline_delimiters.sh index 96be05532..40ce37da3 100755 --- a/src/roff/groff/tests/some_escapes_accept_newline_delimiters.sh +++ b/src/roff/groff/tests/some_escapes_accept_newline_delimiters.sh @@ -1,6 +1,6 @@ #!/bin/sh # -# Copyright (C) 2022 Free Software Foundation, Inc. +# Copyright (C) 2022-2024 Free Software Foundation, Inc. # # This file is part of groff. # @@ -88,19 +88,6 @@ echo "checking correct handling of newline delimiter to 'w' escape" >&2 output=$(printf "%s\n" "$input" | "$groff" -Tascii -ww) test "$output" = "72 D" || wail -input="\X -tty: link http://example.com -D -.pl \n(nlu" - -echo "checking that newline is accepted as delimiter to 'X' escape" >&2 -error=$(printf "%s\n" "$input" | "$groff" -Tascii -ww -z 2>&1) -test -z "$error" || wail - -echo "checking correct handling of newline delimiter to 'X' escape" >&2 -output=$(printf "%s\n" "$input" | "$groff" -Tascii -ww -P -c) -test "$output" = ' D' || wail - input="\Z ABC D diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp index bef0dcabf..0e8e50750 100644 --- a/src/roff/troff/input.cpp +++ b/src/roff/troff/input.cpp @@ -917,10 +917,13 @@ void shift() skip_line(); } -static char get_char_for_escape_parameter(bool allow_space = false) +static char get_char_for_escape_parameter(bool allow_space = false, + bool gbr_flag = false) { - int c = get_copy(0 /* nullptr */, false /* is defining */, + int c = get_copy(0 /* nullptr */, + false /* is defining */, true /* handle \E */); + debug("GBR: processing '%1' (c=%2)", char(c), c); switch (c) { case EOF: copy_mode_error("end of input in escape sequence"); @@ -929,19 +932,27 @@ static char get_char_for_escape_parameter(bool allow_space = false) if (!is_invalid_input_char(c)) break; // fall through - case '\n': - if (c == '\n') - input_stack::push(make_temp_iterator("\n")); +//case '\n': +// input_stack::push(make_temp_iterator("\n")); +// // fall through + case ESCAPE_BANG: + if (gbr_flag) { + //debug("GBR: ignoring an escaped '!'"); + c = ' '; + } // fall through case ' ': - if (c == ' ' && allow_space) + if (allow_space) { + //debug("GBR: we saw a space (or something) but we're allowing it"); break; + } // fall through case '\t': case '\001': case '\b': - copy_mode_error("%1 is not allowed in an escape sequence parameter", - input_char_description(c)); + //copy_mode_error("GBR: %1 is not allowed in an escape sequence" + // " parameter", input_char_description(c)); + /* silently discard */; return '\0'; } return c; @@ -2475,43 +2486,55 @@ int token::operator!=(const token &t) return !(*this == t); } -// is token a suitable delimiter (like ')? +// Is the character usable as a delimiter? +// +// This is used directly only by `do_device_control()`, because it is +// the only escape sequence that reads its argument in copy mode (so it +// doesn't tokenize it) and accepts a user-specified delimiter. +static bool is_char_usable_as_delimiter(int c) +{ + switch(c) { + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + case '+': + case '-': + case '/': + case '*': + case '%': + case '<': + case '>': + case '=': + case '&': + case ':': + case '(': + case ')': + case '.': + return false; + default: + return true; + } +} + +// Is the current token a suitable delimiter (like `'`)? bool token::is_usable_as_delimiter(bool report_error) { + bool is_valid = false; switch(type) { case TOKEN_CHAR: - switch(c) { - case '0': - case '1': - case '2': - case '3': - case '4': - case '5': - case '6': - case '7': - case '8': - case '9': - case '+': - case '-': - case '/': - case '*': - case '%': - case '<': - case '>': - case '=': - case '&': - case ':': - case '(': - case ')': - case '.': - if (report_error) - error("character '%1' is not allowed as a starting delimiter", - char(c)); - return false; - default: - return true; - } + is_valid = is_char_usable_as_delimiter(c); + if (!is_valid && report_error) + error("character '%1' is not allowed as a starting delimiter", + char(c)); + return is_valid; case TOKEN_NODE: // the user doesn't know what a node is if (report_error) @@ -5629,65 +5652,34 @@ static void encode_char_for_troff_output(macro *mac, const char c) mac->append('\\'); } else { - if (tok.is_stretchable_space() - || tok.is_unstretchable_space()) - mac->append(' '); - else if (tok.is_special()) { - const char *sc; - if (font::use_charnames_in_special) { - charinfo *ci = tok.get_char(true /* required */); - sc = ci->get_symbol()->contents(); - } - else - sc = tok.get_char()->get_symbol()->contents(); - if (strcmp("-", sc) == 0) - mac->append('-'); - else if (strcmp("aq", sc) == 0) - mac->append('\''); - else if (strcmp("dq", sc) == 0) - mac->append('"'); - else if (strcmp("ga", sc) == 0) - mac->append('`'); - else if (strcmp("ha", sc) == 0) - mac->append('^'); - else if (strcmp("rs", sc) == 0) - mac->append('\\'); - else if (strcmp("ti", sc) == 0) - mac->append('~'); - else { - if (font::use_charnames_in_special) { - if (sc[0] != (char)0) { - mac->append('\\'); - mac->append('['); - int i = 0; - while (sc[i] != (char)0) { - mac->append(sc[i]); - i++; - } - mac->append(']'); - } - else - error("special character '%1' cannot be used within a" - " device control escape sequence", sc); - } - else - error("special character '%1' cannot be used within a device" - " control escape sequence", sc); - } - } - else if ((tok.is_hyphen_indicator() - || tok.is_space() - || tok.is_dummy() - || tok.is_transparent_dummy() - || tok.is_zero_width_break())) { - ; // silently discard this escape sequence - } - else - error("%1 is invalid within a device control command", - tok.description()); + error("char %1 (c=%2) is invalid within a device control command", + char(c), c); + } +} + +static node *do_device_control() // \X +{ + int c; + macro mac; + int delim = get_copy(0 /* nullptr */); + debug("GBR: delimiter is %1", delim); + while (((c = get_copy(0 /* nullptr */)) != delim) && c != EOF && c != '\n') { + debug("GBR: processing '%1' (c=%2)", char(c), c); + encode_char_for_troff_output(&mac, c); + if (0 == c) + error("GBR: DAMN, need a node handler"); + } + if (c == EOF || c == '\n') + error("unterminated device control escape sequence"); + if (mac.empty()) { + error("device control escape sequence requires an argument"); + return 0 /* nullptr */; } + else + return new special_node(mac); } +#if 0 static node *do_device_control() { int start_level = input_stack::get_level(); @@ -5725,6 +5717,7 @@ static node *do_device_control() } return new special_node(mac); } +#endif static void device_request() { diff --git a/tmac/an.tmac b/tmac/an.tmac index a60eb888f..09ad61484 100644 --- a/tmac/an.tmac +++ b/tmac/an.tmac @@ -1215,7 +1215,9 @@ .de1 UE .de1 MR . if ((\\n[.$] < 2) : (\\n[.$] > 3)) \ . an-style-warn .\\$0 expects 2 or 3 arguments, got \\n[.$] +. tm GBR1: MR \\$@ . ds an*url man:\\$1(\\$2)\" used everywhere but macOS +. tm GBR2: an*url \\*[an*url] . if (\\n[an*MR-URL-format] = 2) \ . ds an*url x-man-page://\\$2/\\$1\" macOS/Mac OS X since 10.3 . if (\\n[an*MR-URL-format] = 3) \ diff --git a/tmac/pspic.tmac b/tmac/pspic.tmac index 98f9ba4f1..456e32c88 100644 --- a/tmac/pspic.tmac +++ b/tmac/pspic.tmac @@ -129,8 +129,7 @@ .de PSPIC . \" between those two escapes . ds ps-invis \X'ps: invis' . ds ps-endinvis \X'ps: endinvis' -. ds ps-import \X'ps: import \E$1 \En[llx] \En[lly] \En[urx] \En[ury] \ - \En[ps-deswid] \E*[ps-desht]' +. ds ps-import \X'ps: import \\$1 \\n[llx] \\n[lly] \\n[urx] \\n[ury] \\n[ps-deswid] \\*[ps-desht]' . \} . el \{\ . ds ps-invis
--- src/roff/troff/input.cpp 2024-02-04 02:14:00.583956798 -0600 +++ ./ATTIC/input.cpp.half-baked 2024-01-23 12:05:55.242368237 -0600 @@ -399,7 +399,7 @@ : fp(f), lineno(1), filename(fn), popened(po), newline_flag(0), seen_escape(0) { - if ((font::use_charnames_in_special) && (fn != 0 /* nullptr */)) { + if ((font::use_charnames_in_special) && (fn != 0)) { if (!the_output) init_output(); the_output->put_filename(fn, po); @@ -917,9 +917,11 @@ skip_line(); } -static char get_char_for_escape_parameter(bool allow_space = false) +static char get_char_for_escape_parameter(bool allow_space = false, + bool want_discard = false) { - int c = get_copy(0 /* nullptr */, false /* is defining */, + int c = get_copy(0 /* nullptr */, + false /* is defining */, true /* handle \E */); switch (c) { case EOF: @@ -940,8 +942,9 @@ case '\t': case '\001': case '\b': - copy_mode_error("%1 is not allowed in an escape sequence parameter", - input_char_description(c)); + if (!want_discard) + copy_mode_error("%1 is not allowed in an escape sequence" + " parameter", input_char_description(c)); return '\0'; } return c; @@ -1279,9 +1282,9 @@ return false; } -non_interpreted_char_node::non_interpreted_char_node(unsigned char cc) : c(cc) +non_interpreted_char_node::non_interpreted_char_node(unsigned char n) : c(n) { - assert(cc != 0); + assert(n != 0); } node *non_interpreted_char_node::copy() @@ -1297,7 +1300,7 @@ static void do_width(); static node *do_non_interpreted(); -static node *do_special(); +static node *do_device_control(); static node *do_suppress(symbol nm); static void do_register(); @@ -1879,7 +1882,7 @@ } units x; for (;;) { - node *n = 0 /* nullptr */; + node *n = 0; int cc = input_stack::get(&n); if (cc != escape_char || escape_char == 0) { handle_ordinary_char: @@ -2039,7 +2042,7 @@ return; case 0: { - assert(n != 0 /* nullptr */); + assert(n != 0); token_node *tn = n->get_token_node(); if (tn) { *this = tn->tk; @@ -2360,7 +2363,7 @@ nd = new extra_size_node(x); return; case 'X': - nd = do_special(); + nd = do_device_control(); if (!nd) break; type = TOKEN_NODE; @@ -2533,7 +2536,7 @@ const char *token::description() { - const size_t bufsz = sizeof "character 'x'" + 1; + const size_t bufsz = strlen("character 'x'") + 1; static char buf[bufsz]; (void) memset(buf, 0, bufsz); switch (type) { @@ -3209,7 +3212,7 @@ } if (!suppress_next) tok.next(); - was_trap_sprung = false; + trap_sprung_flag = 0; } } @@ -3362,33 +3365,33 @@ void node_list::append(node *n) { - if (head == 0 /* nullptr */) { - n->next = 0 /* nullptr */; + if (head == 0) { + n->next = 0; head = tail = n; } else { - n->next = 0 /* nullptr */; + n->next = 0; tail = tail->next = n; } } int node_list::length() { - int total = 0 /* nullptr */; - for (node *n = head; n != 0 /* nullptr */; n = n->next) + int total = 0; + for (node *n = head; n != 0; n = n->next) ++total; return total; } node_list::node_list() { - head = tail = 0 /* nullptr */; + head = tail = 0; } node *node_list::extract() { node *temp = head; - head = tail = 0 /* nullptr */; + head = tail = 0; return temp; } @@ -3408,7 +3411,7 @@ macro::~macro() { - if (p != 0 /* nullptr */ && --(p->count) <= 0) + if (p != 0 && --(p->count) <= 0) delete p; } @@ -3416,12 +3419,12 @@ : is_a_diversion(0), is_a_string(1) { if (!input_stack::get_location(1, &filename, &lineno)) { - filename = 0 /* nullptr */; - lineno = 0 /* nullptr */; + filename = 0; + lineno = 0; } len = 0; empty_macro = 1; - p = 0; /* nullptr */ + p = 0; } macro::macro(const macro &m) @@ -3429,7 +3432,7 @@ empty_macro(m.empty_macro), is_a_diversion(m.is_a_diversion), is_a_string(m.is_a_string), p(m.p) { - if (p != 0 /* nullptr */) + if (p != 0) p->count++; } @@ -3437,13 +3440,13 @@ : is_a_diversion(is_div) { if (!input_stack::get_location(1, &filename, &lineno)) { - filename = 0 /* nullptr */; - lineno = 0 /* nullptr */; + filename = 0; + lineno = 0; } len = 0; empty_macro = 1; is_a_string = 1; - p = 0 /* nullptr */; + p = 0; } int macro::is_diversion() @@ -3464,9 +3467,9 @@ macro ¯o::operator=(const macro &m) { // don't assign object - if (m.p != 0 /* nullptr */) + if (m.p != 0) m.p->count++; - if (p != 0 /* nullptr */ && --(p->count) <= 0) + if (p != 0 && --(p->count) <= 0) delete p; p = m.p; filename = m.filename; @@ -3481,7 +3484,7 @@ void macro::append(unsigned char c) { assert(c != 0); - if (p == 0 /* nullptr */) + if (p == 0) p = new macro_header; if (p->cl.length() != len) { macro_header *tem = p->copy(len); @@ -3497,14 +3500,14 @@ void macro::set(unsigned char c, int offset) { - assert(p != 0 /* nullptr */); + assert(p != 0); assert(c != 0); p->cl.set(c, offset); } unsigned char macro::get(int offset) { - assert(p != 0 /* nullptr */); + assert(p != 0); return p->cl.get(offset); } @@ -3527,8 +3530,8 @@ void macro::append(node *n) { - assert(n != 0 /* nullptr */); - if (p == 0 /* nullptr */) + assert(n != 0); + if (p == 0) p = new macro_header; if (p->cl.length() != len) { macro_header *tem = p->copy(len); @@ -4229,7 +4232,7 @@ // point with a composite mapping? Either the key or value component // of an entry in the composite dictionary qualifies. // -// This is an O(n) search, but by default groff defines only 22 +// This is an O(n) search, but by default groff only defines 22 // composite character mappings ("tmac/composite.tmac"). If this // becomes a performance problem, we will need another dictionary // mapping the unique values of `composite_dictionary` (which is not @@ -4272,15 +4275,15 @@ skip_line(); } -bool was_trap_sprung = false; -static bool are_traps_postponed = false; +int trap_sprung_flag = 0; +int postpone_traps_flag = 0; symbol postponed_trap; void spring_trap(symbol nm) { assert(!nm.is_null()); - was_trap_sprung = true; - if (are_traps_postponed) { + trap_sprung_flag = 1; + if (postpone_traps_flag) { postponed_trap = nm; return; } @@ -4301,19 +4304,19 @@ void postpone_traps() { - are_traps_postponed = true; + postpone_traps_flag = 1; } -bool unpostpone_traps() +int unpostpone_traps() { - are_traps_postponed = false; + postpone_traps_flag = 0; if (!postponed_trap.is_null()) { spring_trap(postponed_trap); postponed_trap = NULL_SYMBOL; - return true; + return 1; } else - return false; + return 0; } void read_request() @@ -4398,7 +4401,7 @@ c = get_copy(&n); macro mac; request_or_macro *rm = (request_or_macro *)request_dictionary.lookup(nm); - macro *mm = rm ? rm->to_macro() : 0 /* nullptr */; + macro *mm = rm ? rm->to_macro() : 0; if (mode == DEFINE_APPEND && mm) mac = *mm; if (comp == COMP_DISABLE) @@ -5616,19 +5619,24 @@ return new non_interpreted_node(mac); } -// In troff output, we translate the escape character to '\', but it is -// up to the postprocessor to interpret it as such. (This mostly -// matters for device control commands.) static void encode_char_for_troff_output(macro *mac, const char c) { - bool is_char_valid = true; - const char *sc = 0 /* nullptr */; - if ('\0' == c) { - if (tok.is_space() - || tok.is_stretchable_space() + // All printable characters get encoded as themselves except the + // escape character. The formatter language's choice of escape + // character is irrelevant to the device-independent output language. + // Encode that as a backslash. + if (csprint(c)) { + if (c != escape_char) + mac->append(c); + else + mac->append('\\'); + } + else { + if (tok.is_stretchable_space() || tok.is_unstretchable_space()) mac->append(' '); else if (tok.is_special()) { + const char *sc; if (font::use_charnames_in_special) { charinfo *ci = tok.get_char(true /* required */); sc = ci->get_symbol()->contents(); @@ -5662,38 +5670,151 @@ mac->append(']'); } else - is_char_valid = false; + error("special character '%1' cannot be used within a" + " device control escape sequence", sc); } else - is_char_valid = false; + error("special character '%1' cannot be used within a device" + " control escape sequence", sc); } } - else if (tok.is_hyphen_indicator() - || tok.is_dummy() - || tok.is_transparent_dummy() - || tok.is_zero_width_break()) - /* silently ignore */; - else - is_char_valid = false; - if (!is_char_valid) { - if (sc != 0 /* nullptr */) - error("special character '%1' is invalid within a device" - " control command", sc); - else - error("%1 is invalid within a device control command", - tok.description()); + else if ((tok.is_hyphen_indicator() + || tok.is_space() + || tok.is_dummy() + || tok.is_transparent_dummy() + || tok.is_zero_width_break())) { + ; // silently discard this escape sequence } + else + error("%1 is invalid within a device control command", + tok.description()); + } +} + +static node *do_device_control() +{ + node *n; + int c; + macro mac; + token start_token; + start_token.next(); + if (!start_token.is_usable_as_delimiter(true /* report error */)){ + do { + tok.next(); + } while (tok != start_token && !tok.is_newline() && !tok.is_eof()); } + if (tok == start_token) + error("device control escape sequence requires an argument"); else { - if (c == escape_char) { - mac->append('\\'); + // Get a character, accepting spaces and discarding embedded escape + // sequences with no representation outside of the formatter. + while ((c = get_char_for_escape_parameter(true, true)) + != start_token.ch() && c != EOF && c != '\n') { + debug("GBR: encoding '%1' (%2) in device control command", char(c), c); +// while ((c = get_copy(&n)) != start_token.ch() && c != EOF && c != '\n') { + if ('\0' == c) { + debug("GBR: ignoring it"); + continue; + } + else + encode_char_for_troff_output(&mac, c); + // XXX: we can kill everything from here forward? + if ((c > 31) && (c < 127)) { + ;// + } + // Convert certain tokens back into characters. + else if (c == 1) { // same in ISO 646 and CCSID 1047 + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, 'a'); + } + else if (c == ESCAPE_BANG) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '!'); + } + else if (c == ESCAPE_SPACE) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, ' '); + } + else if (c == ESCAPE_AMPERSAND) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '&'); + } + else if (c == ESCAPE_COLON) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, ':'); + } + else if (c == ESCAPE_CIRCUMFLEX) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '^'); + } + else if (c == ESCAPE_UNDERSCORE) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '_'); + } + else if (c == ESCAPE_LEFT_BRACE) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '{'); + } + else if (c == ESCAPE_RIGHT_BRACE) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '}'); + } + else if (c == ESCAPE_BAR) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '|'); + } + else if (c == ESCAPE_RIGHT_QUOTE) + encode_char_for_troff_output(&mac, '\''); + else if (c == ESCAPE_HYPHEN) + encode_char_for_troff_output(&mac, '-'); + else if (c == ESCAPE_PERCENT) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '%'); + } + else if (c == ESCAPE_TILDE) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, '~'); + } + else if (c == ESCAPE_E) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, 'E'); + } + else if (c == ESCAPE_c) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, 'c'); + } + else if (c == ESCAPE_e) { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, 'e'); + } + else if (c == '\t') { + encode_char_for_troff_output(&mac, '\\'); + encode_char_for_troff_output(&mac, 't'); + } + else if (c == INPUT_DELETE) + error("a delete character is unsupported in a device control" + " escape sequence"); + else if (c == 0) { + assert(0 == "wuh-oh"); + //debug("GBR: appending node to device control command"); + mac.append(n); + } + else { + debug("GBR: unhandled c=%1 (%2)", char(c), c); + assert(0 == "escape sequence unsupported"); + } } - else - mac->append(c); + if (c == EOF || c == '\n') + error("unterminated device control escape sequence"); } + if (mac.empty()) + return 0 /* nullptr */; + else + return new special_node(mac); } -static node *do_special() +#if 0 +static node *do_device_control() { int start_level = input_stack::get_level(); token start_token; @@ -5730,31 +5851,44 @@ } return new special_node(mac); } +#endif static void device_request() { - if (!has_arg()) { - warning(WARN_MISSING, "device request expects arguments"); - skip_line(); - return; + // We can't use `has_arg()` here because we want to read in copy mode. + int c; + node *n = 0 /* nullptr */; + for (;;) { + c = input_stack::peek(); + if (' ' == c) + (void) get_copy(0 /* nullptr */); + else + break; } - if (tok.is_newline() || tok.is_eof()) { - warning(WARN_MISSING, "device request expects arguments"); + if (('\n' == c) || (EOF == c)) { + warning(WARN_MISSING, "device control request expects arguments"); skip_line(); return; } - if ('"' == tok.ch()) { - tok.next(); - } macro mac; for (;;) { - if (tok.is_newline() || tok.is_eof()) + c = get_copy(&n); + if ('"' == c) { + c = get_copy(0 /* nullptr */); + break; + } + if (c != ' ' && c != '\t') break; - encode_char_for_troff_output(&mac, tok.ch()); - tok.next(); + } + while (c != '\n' && c != EOF) { + if (c != 0) + encode_char_for_troff_output(&mac, c); + else + mac.append(n); + c = get_copy(&n); } curenv->add_node(new special_node(mac)); - skip_line(); + tok.next(); } static void device_macro_request()
signature.asc
Description: PGP signature