Re: [Patch] Support UTF-8 scripts
Lee Revell wrote: > For strings, of course. But there's no need for UTF-8 operators. Indeed - this is the main rationale for the patch, of course. People want to write non-ASCII in script primarily in string literals, and (perhaps even more often) in comments. Now, for comments, it wouldn't really matter that the interpreter knows what the encoding is - but the editor would have to know, and the UTF-8 signature primarily helps the editor (*). Then we are back to the rationale for this patch: if you need the UTF-8 signature to reliably identify the script as being UTF-8 encoded, you then currently cannot easily run it as a script through binfmt_script, as that code requires a script to start with #!. Regards, Martin (*) As I said before: atleast for Python, the UTF-8 signature also has syntactic meaning. It is allowed at the beginning of a file as an addition to the language syntax, and it tells the interpreter that Unicode literals (usually represented internally as UCS-2) are represented as UTF-8 in the source code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] Support UTF-8 scripts
Lee Revell wrote: For strings, of course. But there's no need for UTF-8 operators. Indeed - this is the main rationale for the patch, of course. People want to write non-ASCII in script primarily in string literals, and (perhaps even more often) in comments. Now, for comments, it wouldn't really matter that the interpreter knows what the encoding is - but the editor would have to know, and the UTF-8 signature primarily helps the editor (*). Then we are back to the rationale for this patch: if you need the UTF-8 signature to reliably identify the script as being UTF-8 encoded, you then currently cannot easily run it as a script through binfmt_script, as that code requires a script to start with #!. Regards, Martin (*) As I said before: atleast for Python, the UTF-8 signature also has syntactic meaning. It is allowed at the beginning of a file as an addition to the language syntax, and it tells the interpreter that Unicode literals (usually represented internally as UCS-2) are represented as UTF-8 in the source code. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch] Support UTF-8 scripts
This patch adds support for UTF-8 signatures (aka BOM, byte order mark) to binfmt_script. Files that start with EF BF FF # ! are now recognized as scripts (in addition to files starting with # !). With such support, creating scripts that reliably carry non-ASCII characters is simplified. Editors and the script interpreter can easily agree on what the encoding of the script is, and the interpreter can then render strings appropriately. Currently, Python supports source files that start with the UTF-8 signature; the approach would naturally extend to Perl to enhance/replace the "use utf8" pragma. Likewise, Tcl could use the UTF-8 signature to reliably identify UTF-8 source code (instead of assuming [encoding system] for source code). Please find the patch attached below. Regards, Martin Signed-off-by: Martin v. Löwis <[EMAIL PROTECTED]> diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c --- a/fs/binfmt_script.c +++ b/fs/binfmt_script.c @@ -1,7 +1,7 @@ /* * linux/fs/binfmt_script.c * - * Copyright (C) 1996 Martin von Löwis + * Copyright (C) 1996, 2005 Martin von Löwis * original #!-checking implemented by tytso. */ @@ -23,7 +23,16 @@ static int load_script(struct linux_binp char interp[BINPRM_BUF_SIZE]; int retval; - if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!') || (bprm->sh_bang)) + /* It is a recursive invocation. */ + if (bprm->sh_bang) + return -ENOEXEC; + + /* It starts neither with #!, nor with #! preceded by + the UTF-8 signature. */ + if (!(((bprm->buf[0] == '#') && (bprm->buf[1] == '!')) + || ((bprm->buf[0] == '\xef') && (bprm->buf[1] == '\xbb') + && (bprm->buf[2] == '\xbf') && (bprm->buf[3] == '#') + && (bprm->buf[4] == '!' return -ENOEXEC; /* * This section does the #! interpretation. @@ -46,7 +55,8 @@ static int load_script(struct linux_binp else break; } - for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++); + cp = (bprm->buf[0]=='\xef') ? bprm->buf+5 : bprm->buf+2; + while ((*cp == ' ') || (*cp == '\t')) cp++; if (*cp == '\0') return -ENOEXEC; /* No interpreter name found */ i_name = cp; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch] Support UTF-8 scripts
This patch adds support for UTF-8 signatures (aka BOM, byte order mark) to binfmt_script. Files that start with EF BF FF # ! are now recognized as scripts (in addition to files starting with # !). With such support, creating scripts that reliably carry non-ASCII characters is simplified. Editors and the script interpreter can easily agree on what the encoding of the script is, and the interpreter can then render strings appropriately. Currently, Python supports source files that start with the UTF-8 signature; the approach would naturally extend to Perl to enhance/replace the use utf8 pragma. Likewise, Tcl could use the UTF-8 signature to reliably identify UTF-8 source code (instead of assuming [encoding system] for source code). Please find the patch attached below. Regards, Martin Signed-off-by: Martin v. Löwis [EMAIL PROTECTED] diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c --- a/fs/binfmt_script.c +++ b/fs/binfmt_script.c @@ -1,7 +1,7 @@ /* * linux/fs/binfmt_script.c * - * Copyright (C) 1996 Martin von Löwis + * Copyright (C) 1996, 2005 Martin von Löwis * original #!-checking implemented by tytso. */ @@ -23,7 +23,16 @@ static int load_script(struct linux_binp char interp[BINPRM_BUF_SIZE]; int retval; - if ((bprm-buf[0] != '#') || (bprm-buf[1] != '!') || (bprm-sh_bang)) + /* It is a recursive invocation. */ + if (bprm-sh_bang) + return -ENOEXEC; + + /* It starts neither with #!, nor with #! preceded by + the UTF-8 signature. */ + if (!(((bprm-buf[0] == '#') (bprm-buf[1] == '!')) + || ((bprm-buf[0] == '\xef') (bprm-buf[1] == '\xbb') + (bprm-buf[2] == '\xbf') (bprm-buf[3] == '#') + (bprm-buf[4] == '!' return -ENOEXEC; /* * This section does the #! interpretation. @@ -46,7 +55,8 @@ static int load_script(struct linux_binp else break; } - for (cp = bprm-buf+2; (*cp == ' ') || (*cp == '\t'); cp++); + cp = (bprm-buf[0]=='\xef') ? bprm-buf+5 : bprm-buf+2; + while ((*cp == ' ') || (*cp == '\t')) cp++; if (*cp == '\0') return -ENOEXEC; /* No interpreter name found */ i_name = cp; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/