[ntfs-3g-devel] [PATCH] Internal UTF-8 converter (Embedded device makers may want it badly)

Bernhard Kaindl Thu, 06 Mar 2008 12:17:21 -0800

Hello List
(and Embeded device makers without locale charset converter),

I am not submitting this as a finished patch but as information
for the list because this patch needs some further enhancement
before it could be merged, but the code may already be useful
for some, especially embedded device makers which want to have
support for international characters in NTFS-3G without the need
to have the locale functions from glibc working and the attached
locales installed on the device.


The issue which this patch adresses is that NTFS-3G fully depends
on a working charset converter provided by the C library in order
to present any path name element which contains international
characters for the conversion from UTF-16LE to the User's locale.

If there was a problem in setting that locale up, for example if
there is some misconfiguration, paths which contain international
characters currently can't be shown and can't be accepted by NTFS-3G
as it has no way to convert the file or directory name to/from UTF-16LE.

This patch is a work-in-progress intented to remove this limitation,
but it's current capabilities are limited to using UTF-8 as a fallback.

However, as NTFS interally always uses Unicode to store characters,
an encoding like UTF-8 (and I don't see any competitor around) has
to be useed by the user anyway to be able to receive all paths which
can be stored on an NTFS volume, so basically as far as I am aware,
it's the only encoding really needed and it's widely used.

This patch adds a NTFS-3G-internal UTF-8 converter which is
(in it's current form) limited to conversion to/from UCS-2LE,
which is what NTFS-3G currently uses instead of UTF-16LE, but
this apparently wasn't a major problem so far as characters
which do not fit into UCS-2 are apparently not used in practice.

This internal UTF-8 converter can be enabled by flipping
a global flag and this patch enables it by checking whether
the locale used by NTFS-3G uses an 7-bit ASCII charset or
if there was an error during locale setup, so if either was
true, it kick's in the internal UTF-8 converter.

That part uses nl_langinfo which isn't neccessarily available
on all platforms so this way to enable it would likely require
a configure check or/and option.

Here are the detailed notes:

- The ntfs_ucs_to_utf8() should be tested wether it really
  honors out_len correctly now (implemented, but not yet
  tested for that).

- has TODO comments just as a reminder (can be removed freely)
  to remind that UTF-16 is what NTFS actually uses. UTF-16 has

- uses a global variable as a flag whether glibc ucs functions
  whould be used as usual or whether the internal ucs-2<->utf8
  converter functions are used instead.

- I already allocate room for the worst case (6 bytes UTF-8 output
  per UTF-16 character) it's just wasted space ATM, but to be
  sure to not forget it, I already use 6 instead of 3 when
  allocating the UTF-8 string.

- There is an #ifdef 0 function in it, it was an execercize
  which could be expanded. One would have to try lots of
  characters to really know that the charset converer is
  not working tough, so for now, it's better to skip that
  approach. I just left it in temporary as an example.

- Needs indentation correction.

- it uses ln_langinfo() which is only provided by SUSv2
  and POSIX.1-2001 to decide on the fallback dynamically,
  but besides guessing around, I don't know any other way.

  I think a configure check for it would likely be in order
  if you like to merge it.

  For older or reduced-feature systems which do not have
  locales installed but want these charaters delivered
  anyway (I saw a request for that on the list), it could
  be made even build-time configurable wether to use the
  internal UTF-8 converter.

BTW:
  It should not be hard to implement UTF-16LE, but as the characters
  which would need it are so rare (I do UTF-8/unicode things since
  some years now and ever came across a real charater in that
  range that was used anywhere).

  The conversion from UCS-4 should be rather straight-forward,
  and I can also extend the functions in my patch to go to
  UTF-16 natively. However I am not sure if ntfs-3g has any
  assumption built-in which assumes that the number of bytes
  in ntfschar[] is the number of characters in it. It's true
  for UCS-2, but UTF-16 is variable-length, altough an rather
  easy variable-length charset:

    http://en.wikipedia.org/wiki/UTF-16/UCS-2
    http://en.wikipedia.org/wiki/UTF-16

 Maybe older Windows NT versions before Windows 2000 did't
 even use UTF-16: I have read they only support UCS-2, but
 this source might be wrong.

--- include/ntfs-3g/unistr.h
+++ include/ntfs-3g/unistr.h
@@ -26,6 +26,8 @@
 #include "types.h"
 #include "layout.h"

+extern int use_utf8;
+
 extern BOOL ntfs_names_are_equal(const ntfschar *s1, size_t s1_len,
                const ntfschar *s2, size_t s2_len, const IGNORE_CASE_BOOL ic,
                const ntfschar *upcase, const u32 upcase_size);
--- libntfs-3g/unistr.c
+++ libntfs-3g/unistr.c
@@ -47,6 +47,8 @@
 #include "logging.h"
 #include "misc.h"

+int use_utf8;
+
 /*
  * IMPORTANT
  * =========
@@ -378,6 +380,49 @@ int ntfs_file_values_compare(const FILE_
                        err_val, ic, upcase, upcase_len);
 }

+/*
+ * ntfs_ucs_to_utf8 - convert a little endian Unicode string to an UTF-8 string
+ * @ins:       input Unicode string buffer
+ * @ins_len:   length of input string in Unicode characters
+ * @outs:      on return contains the (allocated) output multibyte string
+ * @outs_len:  length of output buffer in bytes
+ * TODO: Replace this with a function which converts from UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
+ */
+int ntfs_ucs_to_utf8(const ntfschar *ins, const int ins_len,
+                       char *outs, int outs_len)
+{
+       int i = 0;
+       char *t = outs, *end = outs + outs_len;
+
+       ntfs_log_trace("%p (%d) -> %p (%d)\n", ins, ins_len, outs, outs_len);
+
+       for (; i < ins_len && ins[i]; i++) {
+           unsigned short c = le16_to_cpu(ins[i]);
+           if (c < 0x80) {
+               *t++ = c;
+               if (t == end)
+                       goto fail;
+           } else {
+              if (c & 0xf800) {
+                  if (t+3 >= end)
+                       goto fail;
+                  *t++ = 0xe0 | (c >> 12);
+                  *t++ = 0x80 | ((c >> 6) & 0x3f);
+              } else {
+                  if (t+2 >= end)
+                       goto fail;
+                  *t++ = (0xc0 | ((c >> 6) & 0x3f));
+              }
+              *t++ = 0x80 | (c & 0x3f);
+           }
+       }
+       *t = '\0';
+       return t - outs;
+fail:
+       return -1;
+}
+
 /**
  * ntfs_ucstombs - convert a little endian Unicode string to a multibyte string
  * @ins:       input Unicode string buffer
@@ -402,6 +446,8 @@ int ntfs_file_values_compare(const FILE_
  *                     sequence according to the current locale.
  *     ENAMETOOLONG    Destination buffer is too small for input string.
  *     ENOMEM          Not enough memory to allocate destination buffer.
+ * TODO: Replace this with a function which converts from UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
  */
 int ntfs_ucstombs(const ntfschar *ins, const int ins_len, char **outs,
                int outs_len)
@@ -425,11 +471,18 @@ int ntfs_ucstombs(const ntfschar *ins, c
                return -1;
        }
        if (!mbs) {
-               mbs_len = (ins_len + 1) * MB_CUR_MAX;
+               mbs_len = (ins_len + 1) * (use_utf8 ? 6 : MB_CUR_MAX);
                mbs = ntfs_malloc(mbs_len);
                if (!mbs)
                        return -1;
        }
+       if (use_utf8) {
+               int ret = ntfs_ucs_to_utf8(ins, ins_len, mbs, mbs_len);
+               if (ret)
+                       *outs = mbs;
+               return ret;
+       }
+
 #ifdef HAVE_MBSINIT
        memset(&mbstate, 0, sizeof(mbstate));
 #else
@@ -492,6 +546,88 @@ err_out:
        return -1;
 }

+/* This converts one UTF-8 sequence to cpu-endian UCS-2
+ * TODO: Replace this with a function which converts to UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
+*/
+size_t utf8_to_ucs2(wchar_t *wc, const char *s)
+{
+    unsigned int byte;
+    byte = *((unsigned char *)s);
+
+    if (byte == 0) {
+        *wc = (wchar_t) 0;
+        return 0;
+    } else if (byte < 0xC0) {
+        *wc = (wchar_t) byte;
+        return 1;
+    } else if (byte < 0xE0) {
+       if(strlen(s) < 2)
+               goto fail;
+        if ((s[1] & 0xC0) == 0x80) {
+            *wc = (wchar_t) (((byte & 0x1F) << 6) | (s[1] & 0x3F));
+            return 2;
+        } else
+               goto fail;
+    } else if (byte < 0xF0) {
+       if(strlen(s) < 3)
+               goto fail;
+        if (((s[1] & 0xC0) == 0x80) && ((s[2] & 0xC0) == 0x80)) {
+            *wc = (wchar_t) (((byte & 0x0F) << 12)
+                    | ((s[1] & 0x3F) << 6) | (s[2] & 0x3F));
+           /* Surrogates range */
+           if((*wc >= 0xD800 && *wc <= 0xDFFF) ||
+              (*wc == 0xFFFE || *wc == 0xFFFF))
+                       goto fail;
+            return 3;
+        }
+    }
+fail:
+    return -1;
+}
+
+/**
+ * ntfs_utf8_to_ucs - convert a UTF-8 string to a UCS-2LE Unicode string
+ * @ins:       input multibyte string buffer
+ * @outs:      on return contains the (allocated) output Unicode string
+ * @outs_len:  length of output buffer in Unicode characters
+ * TODO: Replace this with a function which converts to UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
+ */
+static int ntfs_utf8_to_ucs(const char *ins, ntfschar **outs, int outs_len)
+{
+       const char *t = ins;
+       wchar_t wc;
+       ntfschar *outpos;
+
+       ntfs_log_trace("'%s' -> %p/%d\n", ins, *outs, outs_len);
+
+       if (!outs_len)
+               outs_len = NTFS_MAX_NAME_LEN + 1;
+
+       if (!*outs)
+               outpos = *outs = ntfs_malloc(outs_len * sizeof(ntfschar));
+
+       while(1) {
+               int m  = utf8_to_ucs2(&wc, t);
+               if (m < 0) {
+                       errno = EILSEQ;
+                       goto fail;
+               }
+               *outpos++ = cpu_to_le16(wc);
+               if (m == 0)
+                       break;
+               if (!--outs_len) {
+                       errno = ENAMETOOLONG;
+                       goto fail;
+               }
+               t += m;
+       }
+    return --outpos - *outs;
+fail:
+    return -1;
+}
+
 /**
  * ntfs_mbstoucs - convert a multibyte string to a little endian Unicode string
  * @ins:       input multibyte string buffer
@@ -515,6 +651,8 @@ err_out:
  *                     string according to the current locale.
  *     ENAMETOOLONG    Destination buffer is too small for input string.
  *     ENOMEM          Not enough memory to allocate destination buffer.
+ * TODO: Replace this with a function which converts to UTF-16LE because
+ * NTFS uses UTF-16LE. UTF-16 supports more rare/unusual characters than UCS-2
  */
 int ntfs_mbstoucs(const char *ins, ntfschar **outs, int outs_len)
 {
@@ -536,6 +674,8 @@ int ntfs_mbstoucs(const char *ins, ntfsc
                errno = ENAMETOOLONG;
                return -1;
        }
+       if (use_utf8)
+               return ntfs_utf8_to_ucs(ins, outs, outs_len);
        /* Determine the size of the multi-byte string in bytes. */
        ins_size = strlen(ins);
        /* Determine the length of the multi-byte string. */
--- src/ntfs-3g.c
+++ src/ntfs-3g.c
@@ -70,6 +70,7 @@
 #include <getopt.h>
 #include <syslog.h>
 #include <sys/wait.h>
+#include <langinfo.h>

 #ifdef HAVE_SETXATTR
 #include <sys/xattr.h>
@@ -2564,6 +2565,29 @@ static void setup_logging(char *parsed_o
        ntfs_log_info("Mount options: %s\n", parsed_options);
 }

+#if 0 // only works for specific locales, would need to be extended:
+void test_unicodeconverter() {
+       ntfschar testchar[] = { 0xe4, 0 };
+       char *mbs = 0;
+       int ret;
+
+       if (ntfs_ucstombs(testchar, 1, &mbs, 0) <= 0)
+               use_utf8 = 1;
+       if (!ntfs_str2ucs(mbs, &ret))
+               use_utf8 = 1;
+       free(mbs);
+}
+#endif
+
+static void check_codeset() {
+       char *codeset = nl_langinfo(CODESET);
+       if (!codeset || !strncmp(codeset, "ANSI", 4)) {
+               ntfs_log_info("Locale invalid or has ANSI codeset: "
+                               "Using UTF-8 for international characters.\n");
+               use_utf8 = 1;
+       }
+}
+
 int main(int argc, char *argv[])
 {
        char *parsed_options = NULL;
@@ -2600,6 +2624,8 @@ int main(int argc, char *argv[])
                err = NTFS_VOLUME_SYNTAX_ERROR;
                goto err_out;
        }
+
+       check_codeset();

 #if defined(linux) || defined(__uClinux__)
        fstype = get_fuse_fstype();

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
ntfs-3g-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ntfs-3g-devel

[ntfs-3g-devel] [PATCH] Internal UTF-8 converter (Embedded device makers may want it badly)

Reply via email to