Re: [ntfs-3g-devel] [PATCH] Internal UTF-8 converter (Embedded device makers may want it badly)

Bernhard Kaindl Fri, 07 Mar 2008 23:35:13 -0800

On Thu, 6 Mar 2008, Bernhard Kaindl wrote:
>
> This patch is a work-in-progress intented to remove this limitation,
> but it's current capabilities are limited to using UTF-8 as a fallback.


I think that the converter functions themselfes should be finished now,
only pending proper indendation.

I tested the converter functions with the ntfs charset conversion exerciser
which I wrote in order to find any bugs and found two remaining bugs that way.

You can run the exerciser yourself by configuring ntfs-3g and then
copy the file into libntfs-3g/ and run for example:

make && cc -g ntfs_testconv.c .libs/libntfs-3g.a && ./a.out

> - The ntfs_ucs_to_utf8() should be tested wether it really
>   honors out_len correctly now (implemented, but not yet
>   tested for that).

Done, there is only one difference left, but it's a weakness in the charset
converter which ntfs-3g inherited from libntfs. It shows when I compare
them when they are called with a pre-allocated output string:

libc-pre:                                         ('                   ')/ 0 = 
-1
libc-pre:                                         ('                   ')/ 1 = 
-1
libc-pre:                                         ('                   ')/ 2 = 
-1
libc-pre:                                         ('                   ')/ 3 = 
-1
libc-pre:                                         ('                   ')/ 4 = 
-1
libc-pre:                                         ('                   ')/ 5 = 
-1
libc-pre:                                         ('ä                 ')/ 6 = -1
libc-pre:                                         ('ä                 ')/ 7 = -1
libc-pre: c3 a4 62    20 20 20 20 20 20 20 20 20  ('äb')/ 8 =  3/3
utf8-pre:                                         ('                   ')/ 0 = 
-1
utf8-pre:                                         ('                   ')/ 1 = 
-1
utf8-pre:                                         ('                   ')/ 2 = 
-1
utf8-pre:                                         ('äb                ')/ 3 = -1
utf8-pre: c3 a4 62    20 20 20 20 20 20 20 20 20  ('äb')/ 4 =  3/3

Here, the current charset converter fails to convert a string which constis
of one ASCII character and one simple european character from Unicode to UTF-8.

Only then the output buffer size reaches of 8 bytes, conversion finally 
succeeds,
but 4 bytes would have been enough to do the conversion as my new UTF-8 
converter
shows. That limiattion likely comes from the old ntfsprogs history.

This calling mode is only in use at one or two places in NTFS-3g currently,
and I hit one of them possibly in my testing (I didn't debug it myself, Szaka
did), but the fix for it needs some further in-depth review and testing.

I did a quick test for the other place (ntfs_fuse_readlink) but as long as
fuse allocates a buffer of 4097 bytes (near MAX_PATH) for the link target to
it, it its is not going to be a problem except for people which insalely long
symlinks, but one would just get an acess error if one manages to mess it up
between systems with different MAX_PATH values.

> - I already allocate room for the worst case (6 bytes UTF-8 output
>   per UTF-16 character) it's just wasted space ATM, but to be
>   sure to not forget it, I already use 6 instead of 3 when
>   allocating the UTF-8 string.

Obsolete and undone: I replaced this with two functions which check
what size the output buffer has to have and that value is than checked
against the outs_len which was passed (if outs was passwed as well)
and malloc's a new buffer if the exact size which is needed (plus for
the trailing null character).

Here comes the new patch, cleaned up, only needs proper indendation
in some functions and likely a configure check for nl_langinfo working
as well as the TODO moved to a separate file.

For further notes, see the mail to which this one answers.

Bernhard

--- /include/ntfs-3g/unistr.h
+++ /include/ntfs-3g/unistr.h
@@ -26,6 +26,8 @@
 #include "types.h"
 #include "layout.h"

+extern int use_utf8;
+
 extern BOOL ntfs_names_are_equal(const ntfschar *s1, size_t s1_len,
                const ntfschar *s2, size_t s2_len, const IGNORE_CASE_BOOL ic,
                const ntfschar *upcase, const u32 upcase_size);
--- /libntfs-3g/unistr.c
+++ /libntfs-3g/unistr.c
@@ -47,6 +47,8 @@
 #include "logging.h"
 #include "misc.h"

+int use_utf8;
+
 /*
  * IMPORTANT
  * =========
@@ -378,6 +380,85 @@ int ntfs_file_values_compare(const FILE_
                        err_val, ic, upcase, upcase_len);
 }

+/* Return the amount of 16-bit elements in UTF-16LE needed (without
+ * the terminating null to store given UTF-8 string and -1 if it does
+ * noy fit into PATH_MAX
+ * TODO: Extend this with a function to suppport UTF-16LE.
+*/
+static int ucs2_to_utf8_size(const ntfschar *ins, const int ins_len, int 
outs_len)
+{
+       int i;
+       int count = 0;
+
+       for (i = 0; i < ins_len && ins[i]; i++) {
+               unsigned short c = le16_to_cpu(ins[i]);
+               if (c < 0x80)
+                       count++;
+               else
+                       count += (c & 0xf800) ? 3 : 2;
+               if (count > outs_len)
+                       goto fail;
+       }
+       return count;
+fail:
+       return -1;
+}
+
+/*
+ * ntfs_ucs_to_utf8 - convert a little endian Unicode string to an UTF-8 string
+ * @ins:       input Unicode string buffer
+ * @ins_len:   length of input string in Unicode characters
+ * @outs:      on return contains the (allocated) output multibyte string
+ * @outs_len:  length of output buffer in bytes
+ * TODO: Replace this with a function which converts from UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
+ */
+int ntfs_ucs_to_utf8(const ntfschar *ins, const int ins_len, char **outs, int 
outs_len)
+{
+       char *t, *end;
+       int i, size;
+
+       if (!*outs)
+               outs_len = PATH_MAX;
+
+       size = ucs2_to_utf8_size(ins, ins_len, outs_len);
+
+       if (size < 0) {
+               errno = ENAMETOOLONG;
+               goto fail;
+       }
+       if (!*outs)
+               *outs = ntfs_malloc((outs_len = size + 1));
+
+       t = *outs;
+       end = t + outs_len;
+
+       for (i = 0; i < ins_len && ins[i]; i++) {
+           unsigned short c = le16_to_cpu(ins[i]);
+           if (c < 0x80) {
+               *t++ = c;
+               if (t == end)
+                       goto fail;
+           } else {
+              if (c & 0xf800) {
+                  if (t+3 >= end)
+                       goto fail;
+                  *t++ = 0xe0 | (c >> 12);
+                  *t++ = 0x80 | ((c >> 6) & 0x3f);
+              } else {
+                  if (t+2 >= end)
+                       goto fail;
+                  *t++ = (0xc0 | ((c >> 6) & 0x3f));
+              }
+              *t++ = 0x80 | (c & 0x3f);
+           }
+       }
+       *t = '\0';
+       return t - *outs;
+fail:
+       return -1;
+}
+
 /**
  * ntfs_ucstombs - convert a little endian Unicode string to a multibyte string
  * @ins:       input Unicode string buffer
@@ -402,6 +483,8 @@ int ntfs_file_values_compare(const FILE_
  *                     sequence according to the current locale.
  *     ENAMETOOLONG    Destination buffer is too small for input string.
  *     ENOMEM          Not enough memory to allocate destination buffer.
+ * TODO: Replace this with a function which converts from UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
  */
 int ntfs_ucstombs(const ntfschar *ins, const int ins_len, char **outs,
                int outs_len)
@@ -424,12 +507,15 @@ int ntfs_ucstombs(const ntfschar *ins, c
                errno = ENAMETOOLONG;
                return -1;
        }
+       if (use_utf8)
+               return ntfs_ucs_to_utf8(ins, ins_len, outs, outs_len);
        if (!mbs) {
                mbs_len = (ins_len + 1) * MB_CUR_MAX;
                mbs = ntfs_malloc(mbs_len);
                if (!mbs)
                        return -1;
        }
+
 #ifdef HAVE_MBSINIT
        memset(&mbstate, 0, sizeof(mbstate));
 #else
@@ -492,6 +578,107 @@ err_out:
        return -1;
 }

+/* Return the amount of 16-bit elements in UTF-16LE needed (without
+ * the terminating null to store given UTF-8 string and -1 if it does
+ * noy fit into PATH_MAX
+ * TODO: Extend this with a function to suppport UTF-16LE.
+*/
+static int utf8_to_ucs2_size(const char *s)
+{
+    unsigned int byte;
+    size_t count = 0;
+
+    while ((byte = *((unsigned char *)s++))) {
+           if (++count >= PATH_MAX || byte >= 0xF0)
+               goto fail;
+           if (!*s) break;
+           if (byte >= 0xC0) s++;
+           if (!*s) break;
+           if (byte >= 0xE0) s++;
+    }
+    return count;
+fail:
+    return -1;
+}
+/* This converts one UTF-8 sequence to cpu-endian UCS-2
+ * TODO: Replace this with a function which converts to UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
+*/
+static int utf8toucs2(wchar_t *wc, const char *s)
+{
+    unsigned int byte = *((unsigned char *)s);
+
+    if (byte == 0) {
+        *wc = (wchar_t) 0;
+        return 0;
+    } else if (byte < 0xC0) {
+        *wc = (wchar_t) byte;
+        return 1;
+    } else if (byte < 0xE0) {
+       if(strlen(s) < 2)
+               goto fail;
+        if ((s[1] & 0xC0) == 0x80) {
+            *wc = (wchar_t) (((byte & 0x1F) << 6) | (s[1] & 0x3F));
+            return 2;
+        } else
+               goto fail;
+    } else if (byte < 0xF0) {
+       if(strlen(s) < 3)
+               goto fail;
+        if (((s[1] & 0xC0) == 0x80) && ((s[2] & 0xC0) == 0x80)) {
+            *wc = (wchar_t) (((byte & 0x0F) << 12)
+                    | ((s[1] & 0x3F) << 6) | (s[2] & 0x3F));
+           /* Surrogates range */
+           if((*wc >= 0xD800 && *wc <= 0xDFFF) ||
+              (*wc == 0xFFFE || *wc == 0xFFFF))
+                       goto fail;
+            return 3;
+        }
+    }
+fail:
+    return -1;
+}
+
+/**
+ * ntfs_utf8_to_ucs - convert a UTF-8 string to a UCS-2LE Unicode string
+ * @ins:       input multibyte string buffer
+ * @outs:      on return contains the (allocated) output Unicode string
+ * @outs_len:  length of output buffer in Unicode characters
+ * TODO: Replace this with a function which converts to UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
+ */
+int ntfs_utf8_to_ucs(const char *ins, ntfschar **outs, int outs_len)
+{
+       const char *t = ins;
+       wchar_t wc;
+       ntfschar *outpos;
+       int shorts = utf8_to_ucs2_size(ins);
+
+       if (shorts < 0 || (*outs && outs_len && shorts > outs_len)) {
+               errno = ENAMETOOLONG;
+               goto fail;
+       }
+       if (!*outs)
+               *outs = ntfs_malloc((shorts+1) * sizeof(ntfschar));
+
+       outpos = *outs;
+
+       while(1) {
+               int m  = utf8toucs2(&wc, t);
+               if (m < 0) {
+                       errno = EILSEQ;
+                       goto fail;
+               }
+               *outpos++ = cpu_to_le16(wc);
+               if (m == 0)
+                       break;
+               t += m;
+       }
+    return --outpos - *outs;
+fail:
+    return -1;
+}
+
 /**
  * ntfs_mbstoucs - convert a multibyte string to a little endian Unicode string
  * @ins:       input multibyte string buffer
@@ -515,6 +702,8 @@ err_out:
  *                     string according to the current locale.
  *     ENAMETOOLONG    Destination buffer is too small for input string.
  *     ENOMEM          Not enough memory to allocate destination buffer.
+ * TODO: Replace this with a function which converts to UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
  */
 int ntfs_mbstoucs(const char *ins, ntfschar **outs, int outs_len)
 {
@@ -536,6 +725,8 @@ int ntfs_mbstoucs(const char *ins, ntfsc
                errno = ENAMETOOLONG;
                return -1;
        }
+       if (use_utf8)
+               return ntfs_utf8_to_ucs(ins, outs, outs_len);
        /* Determine the size of the multi-byte string in bytes. */
        ins_size = strlen(ins);
        /* Determine the length of the multi-byte string. */
diff -rup /src/ntfs-3g.c /src/ntfs-3g.c
--- /src/ntfs-3g.c
+++ /src/ntfs-3g.c
@@ -70,6 +70,7 @@
 #include <getopt.h>
 #include <syslog.h>
 #include <sys/wait.h>
+#include <langinfo.h>

 #ifdef HAVE_SETXATTR
 #include <sys/xattr.h>
@@ -2564,6 +2565,15 @@ static void setup_logging(char *parsed_o
        ntfs_log_info("Mount options: %s\n", parsed_options);
 }

+void check_codeset() {
+       char *codeset = nl_langinfo(CODESET);
+       if (!codeset || !strncmp(codeset, "ANSI", 4)) {
+               ntfs_log_info("Locale invalid or has ANSI codeset: "
+                               "Using UTF-8 for international characters.\n");
+               use_utf8 = 1;
+       }
+}
+
 int main(int argc, char *argv[])
 {
        char *parsed_options = NULL;
@@ -2600,6 +2624,8 @@ int main(int argc, char *argv[])
                err = NTFS_VOLUME_SYNTAX_ERROR;
                goto err_out;
        }
+
+       check_codeset();

 #if defined(linux) || defined(__uClinux__)
        fstype = get_fuse_fstype();

/**
 * ntfs_testconv.c - Excerciser for testing the libntfs charset converter
 *
 * Copyright (c) 2008 Bernhard Kaindl - bk at fsfe dot org
 *
 * This program file is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as published
 * by the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program/include file is distributed in the hope that it will be
 * useful, but WITHOUT ANY WARRANTY; without even the implied warranty
 * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program (in the main directory of the NTFS-3G
 * distribution in the file COPYING); if not, write to the Free Software
 * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */
typedef unsigned short ntfschar ;
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/**
 * At the moment, this is set up to compare the libntfs-internal
 * UTF-8<->UCS-2LE coverter by Bernhard Kaindl against the
 * libc-based charset coverter for UCS-2LE in UTF-8 mode (with UTF-8 locale)
 */
extern int use_utf8;

ntfschar *ntfs_convtest(const char *s, int len)
{
	int i,c,t, max_test_len = 6;
        ntfschar *ucs;
	char *str;

	for (t = 0; t < 2; t++) {
		for (use_utf8 = 0; use_utf8 < 2; use_utf8++) {
			for  (i=0; i<=max_test_len;i++) {
				printf("%s-%s: ", use_utf8 ? "utf8":"libc", t?"pre":"all");
				if (t) {
					ucs = malloc(9*2);
					memset(ucs,0x40,9*2);
				} else
					ucs = 0;
				len = ntfs_mbstoucs(s, &ucs, i);
				for (c=0; c<=7; c++)
					if (len > 0 && ucs[c])
						printf("%04x ", ucs[c]);
					else
						printf("     ");
				printf("(limit:%d, len=%d)\n",i , len);
				if (t && use_utf8 && // Ok, we are in the 4th quarter...
					(i == max_test_len || i && len > 0)) { // Are we done with it?
					if (len > 0) // Ok, first half of this function is done,
						goto next; // we are ready to go to second test loop
					else
						printf("Error, test with max_test_len failed!\n");
				}
				if (i && len > 0) 
                                        break; // Test succeeded with non-zero length, no need to test longer i's.
				// Prune allocated memory:
				if (len > 0)
					memset(ucs,0x50,len*2);
				if (t > 0)
					memset(ucs,0x50,9*2);
				// And free it if it's allocated:
				if (t || len > 0)
					free(ucs);
			}
		}
	}

next:
	for (t = 0; t < 2; t++) {
		for (use_utf8 = 0; use_utf8 < 2; use_utf8++) {
			for  (i=0; i<12;i++) {
				printf("%s-%s: ", use_utf8 ? "utf8":"libc", t?"pre":"all");
				if (t) {
					str = malloc(19);
					memset(str,0x20,19);
				} else
					str = 0;
				int ret=ntfs_ucstombs(ucs, len, &str, i);
				for (c=0; c<=12; c++)
					if (ret > 0 && str[c])
						printf("%02x ", str[c] & 0xff);
					else 
						printf("   ");
				printf(" ('%s')/%2d = %2d", 
					str, i,	ret );
				if (ret > 0)
					printf("/%d", strlen(str));
				puts("");
				if (t) {
					memset(str,0x20,19);
					free(str);
				}
				if (i && ret > 0)
					break;
			}
		}
	}
}

main(argc, argv) {
	setlocale(LC_ALL, "en_US.UTF-8");

	ntfs_convtest("A", 4);
	ntfs_convtest("Ã¤b", 4);
	ntfs_convtest("å¾½ÐÐ±", 16);
}

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

_______________________________________________
ntfs-3g-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ntfs-3g-devel

Re: [ntfs-3g-devel] [PATCH] Internal UTF-8 converter (Embedded device makers may want it badly)

Reply via email to