Change in vdsm[master]: guestIF: Rework XML character filtering

vfeenstr Wed, 10 Jul 2013 01:02:31 -0700

Vinzenz Feenstra has uploaded a new change for review.

Change subject: guestIF: Rework XML character filtering
......................................................................


guestIF: Rework XML character filtering

Data from the guest agent is essentially untrusted, and may contain
characters which cannot be included in XML such as vdsm's XML-RPC
responses.

Commits b5db7c60b61ef6143d3be27c58658ac94c858626 (guestIF: Additional
object strings filtering) and aef1a7d288d0e5f82ebc4fea3e5f0155e9a7f541
(agent: XML Character filtering improvement) are attempts to correct this
filtering, but they are conceptually incorrect and can both fail to
filter characters which need to be filtered, and filter characters which
do not need to be filtered.

aef1a7d288d0e5f82ebc4fea3e5f0155e9a7f541 added code to run the XML
filtering before UTF8 decoding the message from the guest.  This is
incorrect, since the characters permitted in XML files are defined
in terms of Unicode code points, not individual byte values.  This means
that we currently filter out perfectly valid characters such as \u2122
(trademark symbol) because its UTF8 encoding (\xe2\x84\xa2) contains a byte
(0x84) with the same value as the ordinal of one of the XML restricted
characters.  Worse, filtering that byte gives a bytestring which is
no longer valid UTF8, and so will lead to a UnicodeError exception on
decode.

Since that filtering also takes place before UTF8 and JSON decoding, those
steps may still result in the construction of invalid XML characters.

b5db7c60b61ef6143d3be27c58658ac94c858626 partially corrects this by
(re-)adding XML filtering after the UTF8 and JSON decode as well.  This
is conceptually correct, but the filter is not thorough enough.  Currently
the code only filters out characters in set XML defines as
"RestrictedChar".  In XML valid characters are those in the set "Char"
which are not in "RestrictedChar", and the filtering currently leaves in
characters which are neither in "Char" nor "RestrictedChar" - this includes
the surrogate pair code points, \ufffe and \uffff.  \uffff at least will
cause the XML parser in ovirt to choke.

This patch addresses all the above problems.  Additionally it now replaces
bad characters with the official Unicode replacement character '\ufffd',
instead of '?'.  It also adds the 'replace' error handling on the UTF8
decode, so that invalidly encoded UTF8 strings from the guest will not
cause an exception in VDSM.

Change-Id: I6404815c60c14f1499375609aad95e60d5025f80
Signed-off-by: David Gibson <da...@gibson.dropbear.id.au>
---
M tests/guestIFTests.py
M vdsm/guestIF.py
2 files changed, 53 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/52/16652/1

diff --git a/tests/guestIFTests.py b/tests/guestIFTests.py
index 7a392fd..94efe92 100644
--- a/tests/guestIFTests.py
+++ b/tests/guestIFTests.py
@@ -84,36 +84,28 @@
 
 class TestGuestIF(TestCaseBase):
     def testfilterXmlChars(self):
-        ALL_LEGAL = "Hello World"
+        ALL_LEGAL = u"Hello World"
         self.assertEqual(ALL_LEGAL, guestIF._filterXmlChars(ALL_LEGAL))
-        TM = u"\u2122".encode('utf8')
-        self.assertNotEqual(TM, guestIF._filterXmlChars(TM))
+        TM = u"\u2122"
+        self.assertEqual(TM, guestIF._filterXmlChars(TM))
         invalid = u"\u0000"
-        self.assertEqual(u'?', guestIF._filterXmlChars(invalid))
-        invalid2 = "\x00"
-        self.assertEqual('?',  guestIF._filterXmlChars(invalid2))
+        self.assertEqual(u'\ufffd', guestIF._filterXmlChars(invalid))
+        invalid2 = u"\uffff"
+        self.assertEqual(u'\ufffd',  guestIF._filterXmlChars(invalid2))
+        invalid3 = u"\ufffe"
+        self.assertEqual(u'\ufffd',  guestIF._filterXmlChars(invalid3))
+        invalid4 = u"\ud800"
+        self.assertEqual(u'\ufffd',  guestIF._filterXmlChars(invalid4))
+        invalid5 = u"\udc79"
+        self.assertEqual(u'\ufffd',  guestIF._filterXmlChars(invalid5))
 
     def test_filterObject(self):
-        ILLEGAL_DATA = {"foo": "\x00data\x00test"}
-        LEGAL_DATA = {"foo": "?data?test"}
-        EXPECTED_DATA = {"foo": "?data?test"}
+        ILLEGAL_DATA = {u"foo": u"\x00data\x00test\uffff\ufffe\ud800\udc79"}
+        LEGAL_DATA = {u"foo": u"?data?test\U00010000"}
+        EXPECTED_DATA = {
+            u"foo": u"\ufffddata\ufffdtest\ufffd\ufffd\ufffd\ufffd"}
         self.assertEqual(EXPECTED_DATA, guestIF._filterObject(ILLEGAL_DATA))
-        self.assertEqual(EXPECTED_DATA, guestIF._filterObject(LEGAL_DATA))
-
-    def test_StringAndObjectFiltering(self):
-        ILLEGAL_DATA = json.dumps({"foo": "\x00data\x00test"})
-        LEGAL_DATA = json.dumps({"foo": "?data?test"})
-        EXPECTED_DATA = {"foo": "?data?test"}
-
-        filtered = guestIF._filterXmlChars(ILLEGAL_DATA)
-        parsed = json.loads(filtered.decode('utf-8'))
-        filt_obj = guestIF._filterObject(parsed)
-        self.assertEqual(filt_obj, EXPECTED_DATA)
-
-        filtered = guestIF._filterXmlChars(LEGAL_DATA)
-        parsed = json.loads(filtered.decode('utf-8'))
-        filt_obj = guestIF._filterObject(parsed)
-        self.assertEqual(filt_obj, EXPECTED_DATA)
+        self.assertEqual(LEGAL_DATA, guestIF._filterObject(LEGAL_DATA))
 
     def test_handleMessage(self):
         logging.TRACE = 5
diff --git a/vdsm/guestIF.py b/vdsm/guestIF.py
index 07f9c20..dae4c3b 100644
--- a/vdsm/guestIF.py
+++ b/vdsm/guestIF.py
@@ -24,7 +24,9 @@
 import errno
 import json
 import supervdsm
+import unicodedata
 
+__REPLACEMENT_CHAR = u'\ufffd'
 __RESTRICTED_CHARS = set(range(8 + 1)). \
     union(set(range(0xB, 0xC + 1))). \
     union(set(range(0xE, 0x1F + 1))). \
@@ -34,20 +36,38 @@
 
 def _filterXmlChars(u):
     """
-    Filter out restarted xml chars from unicode string. Not using
-    Python's xmlcharrefreplace because it accepts '\x01', which
-    the spec frown upon.
+    The set of characters allowed in XML documents is described in
+    http://www.w3.org/TR/xml11/#charsets
 
-    Set taken from http://www.w3.org/TR/xml11/#NT-RestrictedChar
+    "Char" is defined as any unicode character except the surrogate blocks,
+    \ufffe and \uffff.
+    "RestrictedChar" is defiend as the code points in __RESTRICTED_CHARS above
+
+    It's a little hard to follow, but the upshot is an XML document
+    must contain only characters in Char that are not in
+    RestrictedChar.
+
+    Note that Python's xmlcharrefreplace option is not relevant here -
+    that's about handling characters which can't be encoded in a given
+    charset encoding, not which aren't permitted in XML.
     """
 
-    def maskRestricted(c):
-        if ord(c) in __RESTRICTED_CHARS:
-            return '?'
+    if not isinstance(u, unicode):
+        raise TypeError
+
+    def filterXmlChar(c):
+        if ord(c) > 0x10ffff:
+            return __REPLACEMENT_CHAR  # Outside Unicode range
+        elif unicodedata.category(c) == 'Cs':
+            return __REPLACEMENT_CHAR  # Surrogate pair code point
+        elif (ord(c) == 0xFFFE) or (ord(c) == 0xFFFF):
+            return __REPLACEMENT_CHAR  # Specifically excluded code points
+        elif ord(c) in __RESTRICTED_CHARS:
+            return __REPLACEMENT_CHAR  # Restricted character
         else:
             return c
 
-    return ''.join(maskRestricted(c) for c in u)
+    return ''.join(filterXmlChar(c) for c in u)
 
 
 def _filterObject(obj):
@@ -350,8 +370,14 @@
         return result
 
     def _parseLine(self, line):
-        line = _filterXmlChars(line)
-        args = json.loads(line.decode('utf8'))
+        # Deal with any bad UTF8 encoding from the (untrusted) guest,
+        # by replacing them with the Unicode replacement character
+        uniline = line.decode('utf8', 'replace')
+        args = json.loads(uniline)
+        # Filter out any characters in the untrusted guest response
+        # that aren't permitted in XML.  This must be done _after_ the
+        # JSON decoding, since otherwise JSON's \u escape decoding
+        # could be used to generate the bad characters
         args = _filterObject(args)
         name = args['__name__']
         del args['__name__']


-- 
To view, visit http://gerrit.ovirt.org/16652
To unsubscribe, visit http://gerrit.ovirt.org/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6404815c60c14f1499375609aad95e60d5025f80
Gerrit-PatchSet: 1
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Vinzenz Feenstra <vfeen...@redhat.com>
_______________________________________________
vdsm-patches mailing list
vdsm-patches@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches

Change in vdsm[master]: guestIF: Rework XML character filtering

Reply via email to