Re: [ntfs-3g-devel] Is there a way to ignore "Invalid or incomplete multibyte or wide character"

Erik Larsson Tue, 12 Apr 2016 08:19:49 -0700

Hi Jean-Pierre,

On 2016-04-12 13:37, Jean-Pierre André wrote:

Erik Larsson wrote:

In that case maybe we should unify the two #defines into something like
"ALLOW_BROKEN_UNICODE"?

Probably yes.


In https://en.wikipedia.org/wiki/Specials_(Unicode_block)
the Unicode points U+FFFE and U+U+FFFF are qualified as
"noncharacters", so I guess they should not be present in
a file name. However we do not want to get stuck if this
happens, this is what you have proposed ALLOW_BROKEN_UNICODE
for, and it does not change the current behavior.

I made this a separate patch... see patch 2 in attachments (patch 1should be the same as before).

I was a bit confused because NOREVBOM was already set to 0. To rejectthe BOM code points in NTFS UTF-16 strings it would be set to 1, right?Thought the comment beside NOREVBOM said that you rejected the BOM codepoints, which is the opposite of what the code was actually doing...?

Anyway, please review patch 2 carefully to make sure I didn'tmisunderstand anything.

Or we just keep them separate but make NOREVBOM 0 by default.

IMHO there is no real need, let us keep it simple.


Sounds good. If you agree with these two patches I will push them to git.

Best regards,

- Erik

On 2016-04-08 08:49, Jean-Pierre André wrote:

Hi Erik,

This is good to me. I will just suggest forcing NOREVBOM
to zero when ALLOW_BROKEN_SURROGATES is set.

The NOREVBOM only exists only because I could not find a
reference for how to process a BOM in a file name. If we
want to support bad codes, BOMs must not be rejected.

Regards

Jean-Pierre

Erik Larsson wrote:

Hi Jean-Pierre,

On 2016-04-07 16:52, Jean-Pierre André wrote:

Erik Larsson wrote:

Hi,

On 2016-04-06 19:22, Jean-Pierre André wrote:

Erik Larsson wrote:


[...]

I have a proposal that would enable accessing these broken files in
ntfs-3g and the progs. The proposal involves encoding brokensurrogate
UTF-16 units into their own separate 3-byte UTF-8 sequences. This is
sometimes referred to by the acronym WTF-8 (see:
https://en.wikipedia.org/wiki/UTF-8#WTF-8 ).

The effect is that these files aren't ignored as in the previous
proposed patch but are included in the listing and can be lookedup asany other file since encoding broken UTF-16 to WTF-8 and thenback tobroken UTF-16 is lossless, though the UTF-8 byte sequencesreturned to
user aren't fully Unicode compliant.
However I think this is the best we can do without starting to
manufacture fake file names for these entries with all that
complexity.

Please review the attached patch.


From your proposal, you apparently only have to fix the
processing of an isolated surrogate at the end of utf16
string.

Thanks, I missed this case. I also noticed that you missed wrappingthis

in #if/#else/#endif.
See attachments for my updated v2 patch which does this as well.

With this fix, my test of all possibilities appears to
run fine.


Great.

Best regards,

- Erik

>From d9c61dd60ec484909f70b7a916ada3a93af94b60 Mon Sep 17 00:00:00 2001
From: Erik Larsson <[email protected]>
Date: Fri, 8 Apr 2016 05:39:48 +0200
Subject: [PATCH 1/2] unistr.c: Enable encoding broken UTF-16 into broken
 UTF-8, A.K.A. WTF-8.

Windows filenames may contain invalid UTF-16 sequences (specifically
broken surrogate pairs), which cannot be converted to UTF-8 if we do
strict conversion.

This patch enables encoding broken UTF-16 into similarly broken UTF-8 by
encoding any surrogate character that don't have a match into a separate
3-byte UTF-8 sequence.

This is "sort of" valid UTF-8, but not valid Unicode since the code
points used for surrogate pair encoding are not supposed to occur in a
valid Unicode string... but on the other hand the source UTF-16 data is
also broken, so we aren't really making things any worse.

This format is sometimes referred to as WTF-8 (Wobbly Translation
Format, 8-bit encoding) and is a common solution to represent broken
UTF-16 as UTF-8.

It is a lossless round-trip conversion, i.e converting from broken
UTF-16 to "WTF-8" and back to UTF-16 yields the same broken UTF-16
sequence. Because of this property it enables accessing these files
by filename through ntfs-3g and the ntfsprogs (e.g. ls -la works as
expected).

To disable this behaviour you can pass the preprocessor/compiler flag
'-DALLOW_BROKEN_SURROGATES=0' when building ntfs-3g.
---
 libntfs-3g/unistr.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/libntfs-3g/unistr.c b/libntfs-3g/unistr.c
index 7f278cd..71802aa 100644
--- a/libntfs-3g/unistr.c
+++ b/libntfs-3g/unistr.c
@@ -61,6 +61,11 @@
 
 #define NOREVBOM 0  /* JPA rejecting U+FFFE and U+FFFF, open to debate */
 
+#ifndef ALLOW_BROKEN_SURROGATES
+/* Erik allowing broken UTF-16 surrogate pairs by default, open to debate. */
+#define ALLOW_BROKEN_SURROGATES 1
+#endif /* !defined(ALLOW_BROKEN_SURROGATES) */
+
 /*
  * IMPORTANT
  * =========
@@ -462,8 +467,22 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 			if ((c >= 0xdc00) && (c < 0xe000)) {
 				surrog = FALSE;
 				count += 4;
-			} else 
+			} else {
+#if ALLOW_BROKEN_SURROGATES
+				/* The first UTF-16 unit of a surrogate pair has
+				 * a value between 0xd800 and 0xdc00. It can be
+				 * encoded as an individual UTF-8 sequence if we
+				 * cannot combine it with the next UTF-16 unit
+				 * unit as a surrogate pair. */
+				surrog = FALSE;
+				count += 3;
+
+				--i;
+				continue;
+#else
 				goto fail;
+#endif /* ALLOW_BROKEN_SURROGATES */
+			}
 		} else
 			if (c < 0x80)
 				count++;
@@ -473,6 +492,10 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 				count += 3;
 			else if (c < 0xdc00)
 				surrog = TRUE;
+#if ALLOW_BROKEN_SURROGATES
+			else if (c < 0xe000)
+				count += 3;
+#endif /* ALLOW_BROKEN_SURROGATES */
 #if NOREVBOM
 			else if ((c >= 0xe000) && (c < 0xfffe))
 #else
@@ -487,7 +510,11 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 		}
 	}
 	if (surrog) 
+#if ALLOW_BROKEN_SURROGATES
+		count += 3; /* ending with a single surrogate */
+#else
 		goto fail;
+#endif /* ALLOW_BROKEN_SURROGATES */
 
 	ret = count;
 out:
@@ -548,8 +575,24 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 				*t++ = 0x80 + ((c >> 6) & 15) + ((halfpair & 3) << 4);
 				*t++ = 0x80 + (c & 63);
 				halfpair = 0;
-			} else 
+			} else {
+#if ALLOW_BROKEN_SURROGATES
+				/* The first UTF-16 unit of a surrogate pair has
+				 * a value between 0xd800 and 0xdc00. It can be
+				 * encoded as an individual UTF-8 sequence if we
+				 * cannot combine it with the next UTF-16 unit
+				 * unit as a surrogate pair. */
+				*t++ = 0xe0 | (halfpair >> 12);
+				*t++ = 0x80 | ((halfpair >> 6) & 0x3f);
+				*t++ = 0x80 | (halfpair & 0x3f);
+				halfpair = 0;
+
+				--i;
+				continue;
+#else
 				goto fail;
+#endif /* ALLOW_BROKEN_SURROGATES */
+			}
 		} else if (c < 0x80) {
 			*t++ = c;
 	    	} else {
@@ -562,6 +605,13 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 		        	*t++ = 0x80 | (c & 0x3f);
 			} else if (c < 0xdc00)
 				halfpair = c;
+#if ALLOW_BROKEN_SURROGATES
+			else if (c < 0xe000) {
+				*t++ = 0xe0 | (c >> 12);
+				*t++ = 0x80 | ((c >> 6) & 0x3f);
+				*t++ = 0x80 | (c & 0x3f);
+			}
+#endif /* ALLOW_BROKEN_SURROGATES */
 			else if (c >= 0xe000) {
 				*t++ = 0xe0 | (c >> 12);
 				*t++ = 0x80 | ((c >> 6) & 0x3f);
@@ -570,6 +620,13 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 				goto fail;
 	        }
 	}
+#if ALLOW_BROKEN_SURROGATES
+	if (halfpair) { /* ending with a single surrogate */
+		*t++ = 0xe0 | (halfpair >> 12);
+		*t++ = 0x80 | ((halfpair >> 6) & 0x3f);
+		*t++ = 0x80 | (halfpair & 0x3f);
+	}
+#endif /* ALLOW_BROKEN_SURROGATES */
 	*t = '\0';
 	
 #if defined(__APPLE__) || defined(__DARWIN__)
@@ -693,10 +750,16 @@ static int utf8_to_unicode(u32 *wc, const char *s)
 			/* Check valid ranges */
 #if NOREVBOM
 			if (((*wc >= 0x800) && (*wc <= 0xD7FF))
+#if ALLOW_BROKEN_SURROGATES
+			  || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
+#endif /* ALLOW_BROKEN_SURROGATES */
 			  || ((*wc >= 0xe000) && (*wc <= 0xFFFD)))
 				return 3;
 #else
 			if (((*wc >= 0x800) && (*wc <= 0xD7FF))
+#if ALLOW_BROKEN_SURROGATES
+			  || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
+#endif /* ALLOW_BROKEN_SURROGATES */
 			  || ((*wc >= 0xe000) && (*wc <= 0xFFFF)))
 				return 3;
 #endif
-- 
2.4.9 (Apple Git-60)

>From f0370bfa9c47575d4e47c94e443aa91983683a43 Mon Sep 17 00:00:00 2001
From: Erik Larsson <[email protected]>
Date: Tue, 12 Apr 2016 17:02:40 +0200
Subject: [PATCH 2/2] unistr.c: Unify the two defines NOREVBOM and
 ALLOW_BROKEN_SURROGATES.

In the mailing list discussion we came to the conclusion that there
doesn't seem to be any reason to keep these declarations separate since
they address the same issue, namely libntfs-3g's tolerance for bad
Unicode data in filenames and other UTF-16 strings in the file system,
so merge the two defines into the new define ALLOW_BROKEN_UNICODE.
---
 libntfs-3g/unistr.c | 54 +++++++++++++++++++++++------------------------------
 1 file changed, 23 insertions(+), 31 deletions(-)

diff --git a/libntfs-3g/unistr.c b/libntfs-3g/unistr.c
index 71802aa..753acc0 100644
--- a/libntfs-3g/unistr.c
+++ b/libntfs-3g/unistr.c
@@ -59,12 +59,11 @@
 #include "logging.h"
 #include "misc.h"
 
-#define NOREVBOM 0  /* JPA rejecting U+FFFE and U+FFFF, open to debate */
-
-#ifndef ALLOW_BROKEN_SURROGATES
-/* Erik allowing broken UTF-16 surrogate pairs by default, open to debate. */
-#define ALLOW_BROKEN_SURROGATES 1
-#endif /* !defined(ALLOW_BROKEN_SURROGATES) */
+#ifndef ALLOW_BROKEN_UNICODE
+/* Erik allowing broken UTF-16 surrogate pairs and U+FFFE and U+FFFF by default,
+ * open to debate. */
+#define ALLOW_BROKEN_UNICODE 1
+#endif /* !defined(ALLOW_BROKEN_UNICODE) */
 
 /*
  * IMPORTANT
@@ -468,7 +467,7 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 				surrog = FALSE;
 				count += 4;
 			} else {
-#if ALLOW_BROKEN_SURROGATES
+#if ALLOW_BROKEN_UNICODE
 				/* The first UTF-16 unit of a surrogate pair has
 				 * a value between 0xd800 and 0xdc00. It can be
 				 * encoded as an individual UTF-8 sequence if we
@@ -481,7 +480,7 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 				continue;
 #else
 				goto fail;
-#endif /* ALLOW_BROKEN_SURROGATES */
+#endif /* ALLOW_BROKEN_UNICODE */
 			}
 		} else
 			if (c < 0x80)
@@ -492,15 +491,13 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 				count += 3;
 			else if (c < 0xdc00)
 				surrog = TRUE;
-#if ALLOW_BROKEN_SURROGATES
+#if ALLOW_BROKEN_UNICODE
 			else if (c < 0xe000)
 				count += 3;
-#endif /* ALLOW_BROKEN_SURROGATES */
-#if NOREVBOM
-			else if ((c >= 0xe000) && (c < 0xfffe))
-#else
 			else if (c >= 0xe000)
-#endif
+#else
+			else if ((c >= 0xe000) && (c < 0xfffe))
+#endif /* ALLOW_BROKEN_UNICODE */
 				count += 3;
 			else 
 				goto fail;
@@ -510,11 +507,11 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 		}
 	}
 	if (surrog) 
-#if ALLOW_BROKEN_SURROGATES
+#if ALLOW_BROKEN_UNICODE
 		count += 3; /* ending with a single surrogate */
 #else
 		goto fail;
-#endif /* ALLOW_BROKEN_SURROGATES */
+#endif /* ALLOW_BROKEN_UNICODE */
 
 	ret = count;
 out:
@@ -576,7 +573,7 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 				*t++ = 0x80 + (c & 63);
 				halfpair = 0;
 			} else {
-#if ALLOW_BROKEN_SURROGATES
+#if ALLOW_BROKEN_UNICODE
 				/* The first UTF-16 unit of a surrogate pair has
 				 * a value between 0xd800 and 0xdc00. It can be
 				 * encoded as an individual UTF-8 sequence if we
@@ -591,7 +588,7 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 				continue;
 #else
 				goto fail;
-#endif /* ALLOW_BROKEN_SURROGATES */
+#endif /* ALLOW_BROKEN_UNICODE */
 			}
 		} else if (c < 0x80) {
 			*t++ = c;
@@ -605,13 +602,13 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 		        	*t++ = 0x80 | (c & 0x3f);
 			} else if (c < 0xdc00)
 				halfpair = c;
-#if ALLOW_BROKEN_SURROGATES
+#if ALLOW_BROKEN_UNICODE
 			else if (c < 0xe000) {
 				*t++ = 0xe0 | (c >> 12);
 				*t++ = 0x80 | ((c >> 6) & 0x3f);
 				*t++ = 0x80 | (c & 0x3f);
 			}
-#endif /* ALLOW_BROKEN_SURROGATES */
+#endif /* ALLOW_BROKEN_UNICODE */
 			else if (c >= 0xe000) {
 				*t++ = 0xe0 | (c >> 12);
 				*t++ = 0x80 | ((c >> 6) & 0x3f);
@@ -620,13 +617,13 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 				goto fail;
 	        }
 	}
-#if ALLOW_BROKEN_SURROGATES
+#if ALLOW_BROKEN_UNICODE
 	if (halfpair) { /* ending with a single surrogate */
 		*t++ = 0xe0 | (halfpair >> 12);
 		*t++ = 0x80 | ((halfpair >> 6) & 0x3f);
 		*t++ = 0x80 | (halfpair & 0x3f);
 	}
-#endif /* ALLOW_BROKEN_SURROGATES */
+#endif /* ALLOW_BROKEN_UNICODE */
 	*t = '\0';
 	
 #if defined(__APPLE__) || defined(__DARWIN__)
@@ -748,21 +745,16 @@ static int utf8_to_unicode(u32 *wc, const char *s)
 			    | ((u32)(s[1] & 0x3F) << 6)
 			    | ((u32)(s[2] & 0x3F));
 			/* Check valid ranges */
-#if NOREVBOM
+#if ALLOW_BROKEN_UNICODE
 			if (((*wc >= 0x800) && (*wc <= 0xD7FF))
-#if ALLOW_BROKEN_SURROGATES
 			  || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
-#endif /* ALLOW_BROKEN_SURROGATES */
-			  || ((*wc >= 0xe000) && (*wc <= 0xFFFD)))
+			  || ((*wc >= 0xe000) && (*wc <= 0xFFFF)))
 				return 3;
 #else
 			if (((*wc >= 0x800) && (*wc <= 0xD7FF))
-#if ALLOW_BROKEN_SURROGATES
-			  || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
-#endif /* ALLOW_BROKEN_SURROGATES */
-			  || ((*wc >= 0xe000) && (*wc <= 0xFFFF)))
+			  || ((*wc >= 0xe000) && (*wc <= 0xFFFD)))
 				return 3;
-#endif
+#endif /* ALLOW_BROKEN_UNICODE */
 		}
 		goto fail;
 					/* four-byte */
-- 
2.4.9 (Apple Git-60)

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

_______________________________________________
ntfs-3g-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ntfs-3g-devel

Re: [ntfs-3g-devel] Is there a way to ignore "Invalid or incomplete multibyte or wide character"

Reply via email to