Dan Kenigsberg has submitted this change and it was merged. Change subject: guestIF: Rework XML character filtering ......................................................................
guestIF: Rework XML character filtering Data from the guest agent is essentially untrusted, and may contain characters which cannot be included in XML such as vdsm's XML-RPC responses. Commits b5db7c60b61ef6143d3be27c58658ac94c858626 (guestIF: Additional object strings filtering) and aef1a7d288d0e5f82ebc4fea3e5f0155e9a7f541 (agent: XML Character filtering improvement) are attempts to correct this filtering, but they are conceptually incorrect and can both fail to filter characters which need to be filtered, and filter characters which do not need to be filtered. aef1a7d288d0e5f82ebc4fea3e5f0155e9a7f541 added code to run the XML filtering before UTF8 decoding the message from the guest. This is incorrect, since the characters permitted in XML files are defined in terms of Unicode code points, not individual byte values. This means that we currently filter out perfectly valid characters such as \u2122 (trademark symbol) because its UTF8 encoding (\xe2\x84\xa2) contains a byte (0x84) with the same value as the ordinal of one of the XML restricted characters. Worse, filtering that byte gives a bytestring which is no longer valid UTF8, and so will lead to a UnicodeError exception on decode. Since that filtering also takes place before UTF8 and JSON decoding, those steps may still result in the construction of invalid XML characters. b5db7c60b61ef6143d3be27c58658ac94c858626 partially corrects this by (re-)adding XML filtering after the UTF8 and JSON decode as well. This is conceptually correct, but the filter is not thorough enough. Currently the code only filters out characters in set XML defines as "RestrictedChar". In XML valid characters are those in the set "Char" which are not in "RestrictedChar", and the filtering currently leaves in characters which are neither in "Char" nor "RestrictedChar" - this includes the surrogate pair code points, \ufffe and \uffff. \uffff at least will cause the XML parser in ovirt to choke. This patch addresses all the above problems. Additionally it now replaces bad characters with the official Unicode replacement character '\ufffd', instead of '?'. It also adds the 'replace' error handling on the UTF8 decode, so that invalidly encoded UTF8 strings from the guest will not cause an exception in VDSM. Change-Id: I6404815c60c14f1499375609aad95e60d5025f80 Signed-off-by: David Gibson <da...@gibson.dropbear.id.au> Reviewed-on: http://gerrit.ovirt.org/16652 Reviewed-by: Vinzenz Feenstra <vfeen...@redhat.com> Tested-by: Vinzenz Feenstra <vfeen...@redhat.com> Reviewed-by: Martin Sivák <msi...@redhat.com> Reviewed-by: Dan Kenigsberg <dan...@redhat.com> --- M tests/guestIFTests.py M vdsm/guestIF.py 2 files changed, 53 insertions(+), 35 deletions(-) Approvals: Martin Sivák: Looks good to me, but someone else must approve Vinzenz Feenstra: Verified; Looks good to me, but someone else must approve Dan Kenigsberg: Looks good to me, approved -- To view, visit http://gerrit.ovirt.org/16652 To unsubscribe, visit http://gerrit.ovirt.org/settings Gerrit-MessageType: merged Gerrit-Change-Id: I6404815c60c14f1499375609aad95e60d5025f80 Gerrit-PatchSet: 2 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Vinzenz Feenstra <vfeen...@redhat.com> Gerrit-Reviewer: Arik Hadas <aha...@redhat.com> Gerrit-Reviewer: Dan Kenigsberg <dan...@redhat.com> Gerrit-Reviewer: Martin Sivák <msi...@redhat.com> Gerrit-Reviewer: Michal Skrivanek <michal.skriva...@redhat.com> Gerrit-Reviewer: Vinzenz Feenstra <vfeen...@redhat.com> Gerrit-Reviewer: oVirt Jenkins CI Server _______________________________________________ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches