Re: [edk2] [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

Kinney, Michael D Tue, 05 May 2015 09:13:20 -0700

Jordan,

If we are going to add support for more UNI file formats, there are also EDK II 
specifications that must be updated.


I am not sure I agree with only checking that the string value has supported 
Unicode characters.  If the Name or Language elements have unsupported Unicode 
characters, then that will cause problems too.  I think I would prefer the 
entire file, including all comment lines/blocks to only contain the supported 
Unicode characters. 

I like the addition of OpenUniFile() so we have one place to update if we 
decide to add support for more file formats.  Please look at the code fragments 
below that I think can simplify the logic and make it more 
readable/maintainable by taking advantage of more of the codecs module 
functions and constants.  These code fragments are not based on the current 
trunk, so there may be some unexpected differences.

The codecs module has some constants that can improve the readability of this 
logic.  The following code fragment detects the BOM marker and determines the 
encoding.

            #
            # Read file
            #
            try:
                FileIn = open (LongFilePath(File.Path), mode='rb').read()
            except:
                EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)     
       

            #
            # Detect Byte Order Mark at beginning of file.  Default to UTF-8
            #
            Encoding = 'utf-8'
            if FileIn.startswith (codecs.BOM_UTF16_BE):
                Encoding = 'utf-16be'
                FileIn = FileIn.lstrip (codecs.BOM_UTF16_BE)
            elif FileIn.startswith (codecs.BOM_UTF16_LE):
                Encoding = 'utf-16le'
                FileIn = FileIn.lstrip (codecs.BOM_UTF16_LE)
            elif FileIn.startswith (codecs.BOM_UTF8):
                Encoding = 'utf-8-sig'
                FileIn = FileIn.lstrip (codecs.BOM_UTF8)

The following code fragment uses the codecs module and the encoding detected 
above to verify that all the characters in a UNI file are legal UCS-2 
characters.  If an invalid character is detected, then additional logic is run 
in an except clause to determine the line number of the invalid character.  

            #
            # Convert to unicode
            #
            try:
                FileIn = codecs.decode (FileIn, Encoding)
                Verify = codecs.encode (FileIn, 'utf-16')
                Verify = codecs.decode (Verify, 'utf-16')
            except:
                FileIn = codecs.open (LongFilePath(File.Path), 
encoding=Encoding, mode='r')
                LineNumber = 0
                while True:
                    LineNumber = LineNumber + 1
                    try:
                        Line = FileIn.readline()
                        if Line == '':
                            EdkLogger.error('Unicode File Parser', 
PARSER_ERROR, '%s contains invalid UCS-2 characters.' % (File.Path))
                        Line = codecs.encode (Line, 'utf-16')
                        Line = codecs.decode (Line, 'utf-16')
                    except:
                        EdkLogger.error('Unicode File Parser', PARSER_ERROR, 
'%s contains invalid UCS-2 character at line %d.' % (File.Path, LineNumber))

Best regards,

Mike
        
-----Original Message-----
From: Justen, Jordan L 
Sent: Tuesday, May 05, 2015 12:09 AM
To: edk2-devel@lists.sourceforge.net
Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D
Subject: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

Since UEFI only support UTF-16LE strings internally, this simply
allows for another unicode the source file encoding.

The strings are still converted to UTF-16LE data for use in EDK II
source code.

When .uni files contain UTF-16 data, it is impossible for unicode code
points to be larger than 0xFFFF. To support .uni files that contain
UTF-8 data, we also need to also deal with the possibility that the
UTF-8 file contains unicode code points larger than 16-bits. Since
UEFI only supports 16-bit string data, we make UniClassObject generate
an error if a larger code point is seen in a UTF-8 string value.

We only check string value data, so it is possible to use larger code
points in comments.

v2:
 * Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
 * Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit'
   commit

Cc: Yingke D Liu <yingke.d....@intel.com>
Cc: Michael D Kinney <michael.d.kin...@intel.com>
Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com>
---
 BaseTools/Source/Python/AutoGen/UniClassObject.py | 38 +++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py 
b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..41448ab 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -209,7 +209,7 @@ class UniFileClassObject(object):
         Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
         if len(Lang) != 3:
             try:
-                FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
encoding='utf-16').read()
+                FileIn = self.OpenUniFile(LongFilePath(File.Path))
             except UnicodeError, X:
                 EdkLogger.error("build", FILE_READ_FAILURE, "File read 
failure: %s" % str(X), ExtraData=File);
             except:
@@ -253,6 +253,38 @@ class UniFileClassObject(object):
                     self.OrderedStringDict[LangName][Item.StringName] = 
len(self.OrderedStringList[LangName]) - 1
         return True
 
+    def OpenUniFile(self, FileName):
+        Encoding = 'utf-8'
+        UniFile = open(FileName, 'rb')
+
+        #
+        # Seek to end of file to determine its size
+        #
+        UniFile.seek(0, 2)
+        FileSize = UniFile.tell()
+
+        if FileSize >= 2:
+            #
+            # Seek to start of the file to read the UTF-16 BOM
+            #
+            UniFile.seek(0, 0)
+            Bom = UniFile.read(2)
+            UniFile.seek(0, 0)
+
+            if Bom == '\xff\xfe':
+                Encoding = 'utf-16'
+
+        Info = codecs.lookup(Encoding)
+        return codecs.StreamReaderWriter(UniFile, Info.streamreader, 
Info.streamwriter)
+
+    def Verify16bitCodePoints(self, String):
+        for cp in String:
+            if ord(cp) > 0xffff:
+                tmpl = 'The string {} defined in file {} ' + \
+                       'contains a character with a code point above 0xFFFF.'
+                error = tmpl.format(repr(String), self.File)
+                EdkLogger.error('Unicode File Parser', FORMAT_INVALID, error)
+
     #
     # Get String name and value
     #
@@ -274,6 +306,7 @@ class UniFileClassObject(object):
                 Language = LanguageList[IndexI].split()[0]
                 Value = LanguageList[IndexI][LanguageList[IndexI].find(u'\"') 
+ len(u'\"') : LanguageList[IndexI].rfind(u'\"')] #.replace(u'\r\n', u'')
                 Language = GetLanguageCode(Language, self.IsCompatibleMode, 
self.File)
+                self.Verify16bitCodePoints(Value)
                 self.AddStringToList(Name, Language, Value)
 
     #
@@ -305,7 +338,7 @@ class UniFileClassObject(object):
             EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, 
ExtraData=File.Path)
 
         try:
-            FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
encoding='utf-16')
+            FileIn = self.OpenUniFile(LongFilePath(File.Path))
         except UnicodeError, X:
             EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: 
%s" % str(X), ExtraData=File.Path);
         except:
@@ -426,6 +459,7 @@ class UniFileClassObject(object):
                     MatchString = re.match('[A-Z0-9_]+', Name, re.UNICODE)
                     if MatchString == None or MatchString.end(0) != len(Name):
                         EdkLogger.error('Unicode File Parser', FORMAT_INVALID, 
'The string token name %s defined in UNI file %s contains the invalid lower 
case character.' %(Name, self.File))
+                self.Verify16bitCodePoints(Value)
                 self.AddStringToList(Name, Language, Value)
                 continue
 
-- 
2.1.4


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Re: [edk2] [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

Reply via email to