>Number:         151845
>Category:       kern
>Synopsis:       smbfs should be upgraded to support Unicode
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          update
>Submitter-Id:   current-users
>Arrival-Date:   Sun Oct 31 13:40:09 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Michael Meelis
>Release:        8.1-RELEASE
>Organization:
EasyBOW
>Environment:
8.1-RELEASE FreeBSD 8.1-RELEASE
>Description:
Windows stores all file names in UTF-16 encoding. When you put files from 
windows to freebsd using samba server it converts file names from UTF-16 to 
UTF-8. Then you get files with samba - reverse conversion occurs. This is 
correct lossless bidirectional conversion. This can possible because samba 
server uses modern interaction protocol with UTF-16 encoding support. On this 
way all is ok.

When you want to cp files from freebsd to windows you first mounts windows 
share using mount_smbfs and smbfs.ko. But smbfs.ko (what do the main work) 
supports only old DOS-style interaction protocol without unicode encoding. It 
use simple byte encoding. On windows side server component converts byte coded 
characters into windows UTF-16 using conversion table. By default windows 
(beautiful "I knows better" solution) use CP437. But in most cases to represent 
wide range of file names used ISO8859-1 table. I checks this by analyzing you 
test archive. And this is not all. I found many characters that can't fit into 
ISO8859-1 because they from WINDOWS-1252 table (I done this check too).

So even if we use UTF-8 to CP437 conversion on freebsd we lost most of 
additional characters on freebsd side. If we use UTF-8 to WINDOWS-1252 
conversion on freebsd we not lost anything on freebsd side, but lost the same 
characters as in previous case on windows side.
We MUST change conversion on windows side to correct one - must be used 
WINDOWS-1252 table.
After this we may use UTF-8 to WINDOWS-1252 conversion on freebsd and get 
perfect result.

Additionally I found smbfs have erroneous realization of conversion from 
various byte length characters (UTF-8) to single bytes characters (like 
WINDOWS-1252). And this can't be fixed without significant effort and take a 
long time to debug. But this is no problem - we may use "iconv" option in 
rsync. Libiconv with rsync works perfect.

Continues. All look fine. But windows can put (and do it, I checked it) in the 
file names several control characters not defined in WINDOWS-1252. This 
characters comes from UTF-16 and converts into UTF-8 correct, but conversion 
from UTF-8 to WINDOWS-1252 fails. So we need to make a patch for iconv and 
libiconv to allow conversion in libiconv work without errors (else rsync fails 
with "can't convert name" or similar error).

I near to break down my mind with smbfs and rsync. I makes new patch smbfs with 
replace unconvertible characters to "_". And rsync becomes crazy and copying 
same files to windows share again and again when runs several times with same 
parameters. Funny. But bad.
This problem connected with whole conversion sequences:
on first and next runs while copying files to windows share: 
localfs->rsync->smbfs->iconv->patch->windowsfs
on next runs while finding files need to be rsynced: 
windowsfs->smbfs->iconv->rsync->localfs
Here file named ex. "FrØya.html" converts to "Fr_ya.html" on windows share and 
first rsync run done without errors. But when rsync runs second time it lookups 
windows share for "FrØya.html" but got only "Fr_ya.html" (rsync didn't knows 
about this lossy conversion inside smbfs) and it copies this file again and 
again. Bug.

To fix this we need to leave smbfs module untouched and add new conversion 
table (to do "_" replace inside rsync) to libiconv and use rsync with "iconv" 
option.

Added new encoding "CP437FIXED" with always good conversion to '_' for wrong 
symbols.

>How-To-Repeat:

>Fix:
smbfs should be upgrade to support unicode. Until than work with attached 
libiconv patch and new CP437FIXED encoding. (The full patch & test doesn't fix 
the 100kb & txt extention.

Patch attached with submission follows:

--- libcharset/tools/all-charsets.orig  2009-06-21 11:17:33.000000000 +0000
+++ libcharset/tools/all-charsets       2010-06-29 00:11:59.000000000 +0000
@@ -21,7 +21,7 @@
     ISO-8859-7 | ISO-8859-8 | ISO-8859-9 | ISO-8859-13 | ISO-8859-14 | 
ISO-8859-15 | \
     KOI8-R | KOI8-U | KOI8-T | \
     CP437 | CP775 | CP850 | CP852 | CP855 | CP856 | CP857 | CP861 | CP862 | 
CP864 | CP865 | CP866 | CP869 | CP874 | CP922 | CP932 | CP943 | CP949 | CP950 | 
CP1046 | CP1124 | CP1125 | CP1129 | CP1131 | \
-    CP1250 | CP1251 | CP1252 | CP1253 | CP1254 | CP1255 | CP1256 | CP1257 | \
+    CP1250 | CP1251 | CP1252 | CP437FIXED | CP1253 | CP1254 | CP1255 | CP1256 
| CP1257 | \
     GB2312 | EUC-JP | EUC-KR | EUC-TW | BIG5 | BIG5-HKSCS | GBK | GB18030 | 
SHIFT_JIS | JOHAB | \
     TIS-620 | VISCII | TCVN5712-1 | ARMSCII-8 | GEORGIAN-PS | PT154 | \
     HP-ROMAN8 | HP-ARABIC8 | HP-GREEK8 | HP-HEBREW8 | HP-TURKISH8 | HP-KANA8 | 
\
--- lib/flags.h.orig    2009-06-30 20:52:08.000000000 +0000
+++ lib/flags.h 2010-06-29 00:12:55.000000000 +0000
@@ -54,6 +54,7 @@
 #define ei_cp1250_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
 #define ei_cp1251_oflags (HAVE_QUOTATION_MARKS)
 #define ei_cp1252_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
+#define ei_cp437fixed_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
 #define ei_cp1253_oflags (HAVE_QUOTATION_MARKS)
 #define ei_cp1254_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
 #define ei_cp1255_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
--- lib/encodings.def.orig      2009-06-21 11:17:33.000000000 +0000
+++ lib/encodings.def   2010-06-29 00:14:55.000000000 +0000
@@ -459,6 +459,11 @@
             cp1252)
 #endif
 
+DEFENCODING(( "CP437FIXED",                 /* JDK 1.1 */
+            ),
+            cp437fixed,
+            { cp437fixed_mbtowc, NULL },      { cp437fixed_wctomb, NULL })
+
 DEFENCODING(( "CP1253",                 /* JDK 1.1 */
               "WINDOWS-1253",           /* IANA */
               "MS-GREEK",
--- lib/aliases.h.orig  2009-06-30 20:51:58.000000000 +0000
+++ lib/aliases.h       2010-06-29 14:42:30.000000000 +0000
@@ -32,11 +32,11 @@
 #line 1 "lib/aliases.gperf"
 struct alias { int name; unsigned int encoding_index; };
 
-#define TOTAL_KEYWORDS 346
+#define TOTAL_KEYWORDS 347
 #define MIN_WORD_LENGTH 2
 #define MAX_WORD_LENGTH 45
 #define MIN_HASH_VALUE 7
-#define MAX_HASH_VALUE 935
+#define MAX_HASH_VALUE 936
 /* maximum key range = 929, duplicates = 0 */
 
 #ifdef __GNUC__
@@ -46,24 +46,25 @@
 inline
 #endif
 #endif
+
 static unsigned int
 aliases_hash (register const char *str, register unsigned int len)
 {
   static const unsigned short asso_values[] =
     {
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936,  16,  62, 936,  73,   0,
-        5,   2,  47,   4,   1, 168,   8,  12, 357, 936,
-      936, 936, 936, 936, 936, 112, 123,   3,  14,  34,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937,  16,  62, 937,  73,   0,
+        5,   2,  47,   4,   1, 168,   8,  12, 357, 937,
+      937, 937, 937, 937, 937, 112, 123,   3,  14,  34,
        71, 142, 147,   0, 258,  79,  39, 122,   4,   0,
-      109, 936,  76,   1,  54, 147, 114, 180, 102,   3,
-       10, 936, 936, 936, 936,  34, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936
+      109, 937,  76,   1,  54, 147, 114, 180, 102,   3,
+       10, 937, 937, 937, 937,  34, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937
     };
   register int hval = len;
 
@@ -452,6 +453,7 @@
     char stringpool_str900[sizeof("BIG5-HKSCS:1999")];
     char stringpool_str908[sizeof("MACHEBREW")];
     char stringpool_str935[sizeof("BIG5-HKSCS:2004")];
+    char stringpool_str936[sizeof("CP437FIXED")];
   };
 static const struct stringpool_t stringpool_contents =
   {
@@ -800,7 +802,8 @@
     "CSHALFWIDTHKATAKANA",
     "BIG5-HKSCS:1999",
     "MACHEBREW",
-    "BIG5-HKSCS:2004"
+    "BIG5-HKSCS:2004",
+    "CP437FIXED"
   };
 #define stringpool ((const char *) &stringpool_contents)
 
@@ -1449,7 +1452,8 @@
     {(int)(long)&((struct stringpool_t *)0)->stringpool_str463, ei_cp1254},
 #line 73 "lib/aliases.gperf"
     {(int)(long)&((struct stringpool_t *)0)->stringpool_str464, ei_iso8859_3},
-    {-1},
+#line 358 "lib/aliases.gperf"
+    {(int)(long)&((struct stringpool_t *)0)->stringpool_str936, ei_cp437fixed},
 #line 89 "lib/aliases.gperf"
     {(int)(long)&((struct stringpool_t *)0)->stringpool_str466, ei_iso8859_5},
 #line 20 "lib/aliases.gperf"
--- lib/aliases.gperf.orig      2009-06-30 20:51:58.000000000 +0000
+++ lib/aliases.gperf   2010-06-29 01:12:13.000000000 +0000
@@ -355,3 +355,4 @@
 CSISO2022KR, ei_iso2022_kr
 CHAR, ei_local_char
 WCHAR_T, ei_local_wchar_t
+CP437FIXED, ei_cp437fixed
--- lib/converters.h.orig       2009-06-21 11:17:33.000000000 +0000
+++ lib/converters.h    2010-06-29 00:20:00.000000000 +0000
@@ -160,6 +160,7 @@
 #include "cp1250.h"
 #include "cp1251.h"
 #include "cp1252.h"
+#include "cp437fixed.h"
 #include "cp1253.h"
 #include "cp1254.h"
 #include "cp1255.h"
--- tests/Makefile.in.orig      2010-06-29 00:20:34.000000000 +0000
+++ tests/Makefile.in   2010-06-29 00:20:27.000000000 +0000
@@ -68,6 +68,7 @@
        $(srcdir)/check-stateless $(srcdir) CP1250
        $(srcdir)/check-stateless $(srcdir) CP1251
        $(srcdir)/check-stateless $(srcdir) CP1252
+       $(srcdir)/check-stateless $(srcdir) CP437FIXED
        $(srcdir)/check-stateless $(srcdir) CP1253
        $(srcdir)/check-stateless $(srcdir) CP1254
        $(srcdir)/check-stateless $(srcdir) CP1255
--- tools/Makefile.orig 2010-06-27 18:09:49.000000000 +0000
+++ tools/Makefile      2010-06-29 00:38:07.000000000 +0000
@@ -26,6 +26,7 @@
  cp1250.h \
  cp1251.h \
  cp1252.h \
+ cp437fixed.h \
  cp1253.h \
  cp1254.h \
  cp1255.h \
@@ -191,6 +192,9 @@
 cp1252.h : $(TABLESDIR)/unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP1252.TXT 
8bit_tab_to_h
        ./8bit_tab_to_h CP1252 cp1252 < $<
 
+cp437fixed.h : 
$(TABLESDIR)/unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP437FIXED.TXT 
8bit_tab_to_h
+       ./8bit_tab_to_h CP437FIXED cp437fixed < $<
+
 cp1253.h : $(TABLESDIR)/unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP1253.TXT 
8bit_tab_to_h
        ./8bit_tab_to_h CP1253 cp1253 < $<
 


>Release-Note:
>Audit-Trail:
>Unformatted:
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "[email protected]"

Reply via email to