Re: [Patch] Support UTF-8 scripts

2005-08-14 Thread Martin v. Löwis
Lee Revell wrote:
> For strings, of course.  But there's no need for UTF-8 operators.

Indeed - this is the main rationale for the patch, of course. People
want to write non-ASCII in script primarily in string literals,
and (perhaps even more often) in comments. Now, for comments, it
wouldn't really matter that the interpreter knows what the encoding
is - but the editor would have to know, and the UTF-8 signature
primarily helps the editor (*).

Then we are back to the rationale for this patch: if you need the
UTF-8 signature to reliably identify the script as being UTF-8
encoded, you then currently cannot easily run it as a script through
binfmt_script, as that code requires a script to start with #!.

Regards,
Martin

(*) As I said before: atleast for Python, the UTF-8 signature also
has syntactic meaning. It is allowed at the beginning of a file
as an addition to the language syntax, and it tells the interpreter
that Unicode literals (usually represented internally as UCS-2)
are represented as UTF-8 in the source code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] Support UTF-8 scripts

2005-08-14 Thread Martin v. Löwis
Lee Revell wrote:
 For strings, of course.  But there's no need for UTF-8 operators.

Indeed - this is the main rationale for the patch, of course. People
want to write non-ASCII in script primarily in string literals,
and (perhaps even more often) in comments. Now, for comments, it
wouldn't really matter that the interpreter knows what the encoding
is - but the editor would have to know, and the UTF-8 signature
primarily helps the editor (*).

Then we are back to the rationale for this patch: if you need the
UTF-8 signature to reliably identify the script as being UTF-8
encoded, you then currently cannot easily run it as a script through
binfmt_script, as that code requires a script to start with #!.

Regards,
Martin

(*) As I said before: atleast for Python, the UTF-8 signature also
has syntactic meaning. It is allowed at the beginning of a file
as an addition to the language syntax, and it tells the interpreter
that Unicode literals (usually represented internally as UCS-2)
are represented as UTF-8 in the source code.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch] Support UTF-8 scripts

2005-08-13 Thread Martin v. Löwis
This patch adds support for UTF-8 signatures (aka BOM, byte order
mark) to binfmt_script. Files that start with EF BF FF # ! are now
recognized as scripts (in addition to files starting with # !).

With such support, creating scripts that reliably carry non-ASCII
characters is simplified. Editors and the script interpreter can
easily agree on what the encoding of the script is, and the
interpreter can then render strings appropriately. Currently,
Python supports source files that start with the UTF-8 signature;
the approach would naturally extend to Perl to enhance/replace
the "use utf8" pragma. Likewise, Tcl could use the UTF-8 signature
to reliably identify UTF-8 source code (instead of assuming
[encoding system] for source code).

Please find the patch attached below.

Regards,
Martin

Signed-off-by: Martin v. Löwis <[EMAIL PROTECTED]>

diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -1,7 +1,7 @@
 /*
  *  linux/fs/binfmt_script.c
  *
- *  Copyright (C) 1996  Martin von Löwis
+ *  Copyright (C) 1996, 2005  Martin von Löwis
  *  original #!-checking implemented by tytso.
  */

@@ -23,7 +23,16 @@ static int load_script(struct linux_binp
char interp[BINPRM_BUF_SIZE];
int retval;

-   if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!') ||
(bprm->sh_bang))
+   /* It is a recursive invocation. */
+   if (bprm->sh_bang)
+   return -ENOEXEC;
+
+   /* It starts neither with #!, nor with #! preceded by
+  the UTF-8 signature. */
+   if (!(((bprm->buf[0] == '#') && (bprm->buf[1] == '!'))
+ || ((bprm->buf[0] == '\xef') && (bprm->buf[1] == '\xbb')
+ && (bprm->buf[2] == '\xbf') && (bprm->buf[3] == '#')
+ && (bprm->buf[4] == '!'
return -ENOEXEC;
/*
 * This section does the #! interpretation.
@@ -46,7 +55,8 @@ static int load_script(struct linux_binp
else
break;
}
-   for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++);
+   cp = (bprm->buf[0]=='\xef') ? bprm->buf+5 : bprm->buf+2;
+   while ((*cp == ' ') || (*cp == '\t')) cp++;
if (*cp == '\0')
return -ENOEXEC; /* No interpreter name found */
i_name = cp;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch] Support UTF-8 scripts

2005-08-13 Thread Martin v. Löwis
This patch adds support for UTF-8 signatures (aka BOM, byte order
mark) to binfmt_script. Files that start with EF BF FF # ! are now
recognized as scripts (in addition to files starting with # !).

With such support, creating scripts that reliably carry non-ASCII
characters is simplified. Editors and the script interpreter can
easily agree on what the encoding of the script is, and the
interpreter can then render strings appropriately. Currently,
Python supports source files that start with the UTF-8 signature;
the approach would naturally extend to Perl to enhance/replace
the use utf8 pragma. Likewise, Tcl could use the UTF-8 signature
to reliably identify UTF-8 source code (instead of assuming
[encoding system] for source code).

Please find the patch attached below.

Regards,
Martin

Signed-off-by: Martin v. Löwis [EMAIL PROTECTED]

diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -1,7 +1,7 @@
 /*
  *  linux/fs/binfmt_script.c
  *
- *  Copyright (C) 1996  Martin von Löwis
+ *  Copyright (C) 1996, 2005  Martin von Löwis
  *  original #!-checking implemented by tytso.
  */

@@ -23,7 +23,16 @@ static int load_script(struct linux_binp
char interp[BINPRM_BUF_SIZE];
int retval;

-   if ((bprm-buf[0] != '#') || (bprm-buf[1] != '!') ||
(bprm-sh_bang))
+   /* It is a recursive invocation. */
+   if (bprm-sh_bang)
+   return -ENOEXEC;
+
+   /* It starts neither with #!, nor with #! preceded by
+  the UTF-8 signature. */
+   if (!(((bprm-buf[0] == '#')  (bprm-buf[1] == '!'))
+ || ((bprm-buf[0] == '\xef')  (bprm-buf[1] == '\xbb')
+  (bprm-buf[2] == '\xbf')  (bprm-buf[3] == '#')
+  (bprm-buf[4] == '!'
return -ENOEXEC;
/*
 * This section does the #! interpretation.
@@ -46,7 +55,8 @@ static int load_script(struct linux_binp
else
break;
}
-   for (cp = bprm-buf+2; (*cp == ' ') || (*cp == '\t'); cp++);
+   cp = (bprm-buf[0]=='\xef') ? bprm-buf+5 : bprm-buf+2;
+   while ((*cp == ' ') || (*cp == '\t')) cp++;
if (*cp == '\0')
return -ENOEXEC; /* No interpreter name found */
i_name = cp;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/