[PATCH branch-1.6] input: Revert "Reorder token recognition to match other implementations."

Eric Blake via M4-patches Wed, 21 May 2025 19:50:17 -0700

This reverts commit 16e712b9dbc.  It turns out that having comments as
a higher priority than macros makes it possible to parse files that
use both REM or REMARK as a comment, as well as to parse files that
contain mismatched () as a single comment long enough to then translit
those into something safe for further m4 handling.  POSIX may say it
is undefined, but since I actually encountered a case with the Advent
of Code challenges (adventofcode.com) where being able to abuse GNU m4
comment semantics let me solve a problem in m4, I see no reason to
break it after all.


* NEWS: Remove mention of reverted change.
* doc/m4.texi (Changecom): Update text to describe older rules, but
with more tests.
* src/input.c (next_token): Parse comments with highest priority.
---
 NEWS        |   4 --
 doc/m4.texi |  45 ++++++++++++--------
 src/input.c | 116 ++++++++++++++++++++++++++--------------------------
 3 files changed, 86 insertions(+), 79 deletions(-)

diff --git a/NEWS b/NEWS
index 97c4d9c0..54802fdb 100644
--- a/NEWS
+++ b/NEWS
@@ -46,10 +46,6 @@ GNU M4 NEWS - User visible changes.
    then apply this patch:
      http://git.sv.gnu.org/gitweb/?p=autoconf.git;a=commitdiff;h=56d42fa71

-** The `changecom' builtin semantics now match traditional
-   implementations; if the start-comment string resembles a macro name or
-   the start-quote string, comments are effectively disabled.
-
 ** The `divert' builtin now accepts an optional second argument of text
    that is immediately placed in the new diversion, regardless of whether
    the current expansion is nested within argument collection of another
diff --git a/doc/m4.texi b/doc/m4.texi
index d8befa52..3a5e55ad 100644
--- a/doc/m4.texi
+++ b/doc/m4.texi
@@ -5305,12 +5305,21 @@ Changecom
 of any length.  Other implementations cap the delimiter length to five
 characters, but GNU has no inherent limit.

-As of M4 1.6, macros and quotes are recognized in preference to
-comments, so if a prefix of @var{start} can be recognized as part of a
-potential macro name, or confused with a quoted string, the comment
-mechanism is effectively disabled (earlier versions of GNU M4
-favored comments, but this was inconsistent with other implementations).
-This means
-that @var{start} should not begin with a letter, digit, or @samp{_}
-(underscore), and that neither the start-quote nor the start-comment
-string should be a prefix of the other.
+POSIX states that behavior is unspecified if either @var{start} or
+@var{end} contains letters, numbers, underscore, or left parenthesis;
+this is because other implementations do not agree on whether a comment
+or a macro name should take precedence.  Likewise, neither the
+start-quote nor the start-comment string should be a prefix of the
+other.  However, as an extension, GNU M4 allows any of those characters
+in the @code{changecom} arguments, and recognizes comments with a higher
+precedence than macros or quoted strings.  One use of this non-portable
+extension is to temporarily change comments to a start sequence known to
+appear early in a file, and an end sequence that does not appear in the
+file, in order to then use @code{include} on a file that may contain
+unbalanced parenthesis or unquoted macros or commas, thereby treating
+that entire file as a single comment that can then be passed as a single
+argument to @code{translit} or @code{patsubst} to convert the comment
+into something safe to process.  Note that a comment delimiter is not
+recognized in the middle of a potential macro name or quoted string, but
+that a comment delimiter that resembles a macro name is not avoided
+merely by appending more macro characters.

 @example
 define(`hi', `HI')
@@ -5323,33 +5332,35 @@ Changecom
 changecom(`q', `Q')
 @result{}
 q hi Q hi
-@result{}q HI Q HI
+@result{}q hi Q HI
 changecom(`1', `2')
 @result{}
 hi1hi2
 @result{}hello
 hi 1hi2
 @result{}HI 1hi2
-changecom(`[[', `]]')
+changecom(`REM')
 @result{}
-changequote(`[[[', `]]]')
+REM hi
+@result{}REM hi
+REMARK hi
+@result{}REMARK hi
+changecom(`[[', `]]')changequote(`[[[', `]]]')
 @result{}
 [hi]
 @result{}[HI]
 [[hi]]
 @result{}[[hi]]
 [[[hi]]]
-@result{}hi
+@result{}[[[hi]]]
 changequote
 @result{}
-changecom(`[[[', `]]]')
-@result{}
-changequote(`[[', `]]')
+changecom(`[[[', `]]]')changequote(`[[', `]]')
 @result{}
 [[hi]]
 @result{}hi
 [[[hi]]]
-@result{}[hi]
+@result{}[[[hi]]]
 @end example

 Comments are recognized in preference to argument collection.  In
diff --git a/src/input.c b/src/input.c
index 786a5b39..0dceb7b4 100644
--- a/src/input.c
+++ b/src/input.c
@@ -1884,7 +1884,64 @@ next_token (token_data *td, int *line, struct obstack 
*obs, bool allow_argv,
       return TOKEN_ARGV;
     }

-  if (c_isalpha (ch) || ch == '_')
+  if (MATCH (ch, curr_comm.str1, curr_comm.len1, true))
+    {
+      if (obs)
+        obs_td = obs;
+      obstack_grow (obs_td, curr_comm.str1, curr_comm.len1);
+      while (1)
+        {
+          /* Start with buffer search for potential end delimiter.  */
+          size_t len;
+          const char *buffer = next_buffer (&len, false);
+          if (buffer)
+            {
+              const char *p = (char *) memchr (buffer, *curr_comm.str2, len);
+              if (p)
+                {
+                  obstack_grow (obs_td, buffer, p - buffer);
+                  ch = to_uchar (*p);
+                  consume_buffer (p - buffer + 1);
+                }
+              else
+                {
+                  obstack_grow (obs_td, buffer, len);
+                  consume_buffer (len);
+                  continue;
+                }
+            }
+
+          /* Fall back to byte-wise search.  */
+          else
+            ch = next_char (false, false);
+          if (ch == CHAR_EOF)
+            {
+              /* Current_file changed to "" if we see CHAR_EOF, use
+                 the previous value we stored earlier.  */
+              if (!caller)
+                {
+                  assert (line);
+                  current_line = *line;
+                  current_file = file;
+                }
+              m4_error (EXIT_FAILURE, 0, caller, _("end of file in comment"));
+            }
+          if (ch == CHAR_MACRO)
+            {
+              init_macro_token (obs, obs ? td : NULL);
+              continue;
+            }
+          if (MATCH (ch, curr_comm.str2, curr_comm.len2, true))
+            {
+              obstack_grow (obs_td, curr_comm.str2, curr_comm.len2);
+              break;
+            }
+          assert (ch < CHAR_EOF);
+          obstack_1grow (obs_td, ch);
+        }
+      type = TOKEN_COMMENT;
+    }
+  else if (c_isalpha (ch) || ch == '_')
     {
       obstack_1grow (&token_stack, ch);
       while (1)
@@ -2002,63 +2059,6 @@ next_token (token_data *td, int *line, struct obstack 
*obs, bool allow_argv,
             }
         }
     }
-  else if (MATCH (ch, curr_comm.str1, curr_comm.len1, true))
-    {
-      if (obs)
-        obs_td = obs;
-      obstack_grow (obs_td, curr_comm.str1, curr_comm.len1);
-      while (1)
-        {
-          /* Start with buffer search for potential end delimiter.  */
-          size_t len;
-          const char *buffer = next_buffer (&len, false);
-          if (buffer)
-            {
-              const char *p = (char *) memchr (buffer, *curr_comm.str2, len);
-              if (p)
-                {
-                  obstack_grow (obs_td, buffer, p - buffer);
-                  ch = to_uchar (*p);
-                  consume_buffer (p - buffer + 1);
-                }
-              else
-                {
-                  obstack_grow (obs_td, buffer, len);
-                  consume_buffer (len);
-                  continue;
-                }
-            }
-
-          /* Fall back to byte-wise search.  */
-          else
-            ch = next_char (false, false);
-          if (ch == CHAR_EOF)
-            {
-              /* Current_file changed to "" if we see CHAR_EOF, use
-                 the previous value we stored earlier.  */
-              if (!caller)
-                {
-                  assert (line);
-                  current_line = *line;
-                  current_file = file;
-                }
-              m4_error (EXIT_FAILURE, 0, caller, _("end of file in comment"));
-            }
-          if (ch == CHAR_MACRO)
-            {
-              init_macro_token (obs, obs ? td : NULL);
-              continue;
-            }
-          if (MATCH (ch, curr_comm.str2, curr_comm.len2, true))
-            {
-              obstack_grow (obs_td, curr_comm.str2, curr_comm.len2);
-              break;
-            }
-          assert (ch < CHAR_EOF);
-          obstack_1grow (obs_td, ch);
-        }
-      type = TOKEN_COMMENT;
-    }
   else
     {
       assert (ch < CHAR_EOF);
-- 
2.49.0


_______________________________________________
M4-patches mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/m4-patches

[PATCH branch-1.6] input: Revert "Reorder token recognition to match other implementations."

Reply via email to