Re: [edk2] [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

Kinney, Michael D Tue, 05 May 2015 13:25:17 -0700

Jordan,

My main point about the spec is that both need to be updated if this concept is 
accepted.  Not that one has to be done before the other.


Mike

-----Original Message-----
From: Justen, Jordan L 
Sent: Tuesday, May 05, 2015 11:52 AM
To: Kinney, Michael D; edk2-devel@lists.sourceforge.net
Cc: Liu, Yingke D
Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

On 2015-05-05 11:32:57, Kinney, Michael D wrote:
> Jordan,
> 
> Here is the link to the EDK II spec that states that UNI files must
> be UTF16-LE
> 
> http://sourceforge.net/projects/edk2/files/Specifications/UNI_File_Spec_v1_2_Errata_A.pdf/download

I don't know where the source code is for this document, so I don't
see how I can update it.

I think adding this support to EDK II will still mean that BaseTools
supports the current version of the spec since it can still handle
UTF-16 files. The last patch to convert the file in OvmfPkg to UTF-8
may be invalid by the spec.

Anyway, I think the more natural way to make proceed is to add UTF-8
.uni support to EDK II and then to document the support in the spec.
Do you disagree?

I don't want to waste any more time on this, if we can't add it to EDK
II because it is not in the spec. Likewise, I don't see how we can add
it to the spec if it is not supported by EDK II.

> I think there are good reasons to verify that the entire UNI file
> only contains UCS-2 characters. Editors are usually in a single
> Unicode mode, so it would be difficult when using an editor in an
> extended mode to know if only the string contents are UCS-2. Also,
> if we decide to add helper tools to support conversion to UTF-16LE
> to follow current the EDK II UNI specification for releases or
> compatibility with other tools, then what would that helper tool do
> with a non UCS-2 character? It is just easier for the whole file to
> use the same character set.

Ok. That sounds like a good reason.

-Jordan

> -----Original Message-----
> From: Justen, Jordan L 
> Sent: Tuesday, May 05, 2015 10:53 AM
> To: Kinney, Michael D; edk2-devel@lists.sourceforge.net
> Cc: Liu, Yingke D
> Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files
> 
> On 2015-05-05 09:07:11, Kinney, Michael D wrote:
> > Jordan,
> > 
> > If we are going to add support for more UNI file formats, there are
> > also EDK II specifications that must be updated.
> 
> I don't know about that. I at least looked under
> BaseTools/UserManuals, and didn't find anything obvious to update.
> 
> I'm not sure who owns updating other specs, but they can probably make
> the changes to the spec when new features appear in a code base. (The
> old UTF-16 support should still work as spec'd.) Wouldn't this be more
> of a concern for future UDK releases?
> 
> > I am not sure I agree with only checking that the string value has
> > supported Unicode characters. If the Name or Language elements have
> > unsupported Unicode characters, then that will cause problems too. I
> > think I would prefer the entire file, including all comment
> > lines/blocks to only contain the supported Unicode characters.
> 
> If you are referring to comment in the UTF-8 version of the file, why
> constrain them? Obviously there are technical reasons for constraining
> string values, but the same need not apply to comments.
> 
> > I like the addition of OpenUniFile() so we have one place to update
> 
> ... open place, except apparently UPT has duplicated most of this
> code. :)
> 
> -Jordan
> 
> > if we decide to add support for more file formats. Please look at
> > the code fragments below that I think can simplify the logic and
> > make it more readable/maintainable by taking advantage of more of
> > the codecs module functions and constants. These code fragments are
> > not based on the current trunk, so there may be some unexpected
> > differences.
> > 
> > The codecs module has some constants that can improve the
> > readability of this logic. The following code fragment detects the
> > BOM marker and determines the encoding.
> > 
> >             #
> >             # Read file
> >             #
> >             try:
> >                 FileIn = open (LongFilePath(File.Path), mode='rb').read()
> >             except:
> >                 EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File) 
> >            
> > 
> >             #
> >             # Detect Byte Order Mark at beginning of file.  Default to UTF-8
> >             #
> >             Encoding = 'utf-8'
> >             if FileIn.startswith (codecs.BOM_UTF16_BE):
> >                 Encoding = 'utf-16be'
> >                 FileIn = FileIn.lstrip (codecs.BOM_UTF16_BE)
> >             elif FileIn.startswith (codecs.BOM_UTF16_LE):
> >                 Encoding = 'utf-16le'
> >                 FileIn = FileIn.lstrip (codecs.BOM_UTF16_LE)
> >             elif FileIn.startswith (codecs.BOM_UTF8):
> >                 Encoding = 'utf-8-sig'
> >                 FileIn = FileIn.lstrip (codecs.BOM_UTF8)
> > 
> > The following code fragment uses the codecs module and the encoding 
> > detected above to verify that all the characters in a UNI file are legal 
> > UCS-2 characters.  If an invalid character is detected, then additional 
> > logic is run in an except clause to determine the line number of the 
> > invalid character.  
> > 
> >             #
> >             # Convert to unicode
> >             #
> >             try:
> >                 FileIn = codecs.decode (FileIn, Encoding)
> >                 Verify = codecs.encode (FileIn, 'utf-16')
> >                 Verify = codecs.decode (Verify, 'utf-16')
> >             except:
> >                 FileIn = codecs.open (LongFilePath(File.Path), 
> > encoding=Encoding, mode='r')
> >                 LineNumber = 0
> >                 while True:
> >                     LineNumber = LineNumber + 1
> >                     try:
> >                         Line = FileIn.readline()
> >                         if Line == '':
> >                             EdkLogger.error('Unicode File Parser', 
> > PARSER_ERROR, '%s contains invalid UCS-2 characters.' % (File.Path))
> >                         Line = codecs.encode (Line, 'utf-16')
> >                         Line = codecs.decode (Line, 'utf-16')
> >                     except:
> >                         EdkLogger.error('Unicode File Parser', 
> > PARSER_ERROR, '%s contains invalid UCS-2 character at line %d.' % 
> > (File.Path, LineNumber))
> > 
> > Best regards,
> > 
> > Mike
> >         
> > -----Original Message-----
> > From: Justen, Jordan L 
> > Sent: Tuesday, May 05, 2015 12:09 AM
> > To: edk2-devel@lists.sourceforge.net
> > Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D
> > Subject: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files
> > 
> > Since UEFI only support UTF-16LE strings internally, this simply
> > allows for another unicode the source file encoding.
> > 
> > The strings are still converted to UTF-16LE data for use in EDK II
> > source code.
> > 
> > When .uni files contain UTF-16 data, it is impossible for unicode code
> > points to be larger than 0xFFFF. To support .uni files that contain
> > UTF-8 data, we also need to also deal with the possibility that the
> > UTF-8 file contains unicode code points larger than 16-bits. Since
> > UEFI only supports 16-bit string data, we make UniClassObject generate
> > an error if a larger code point is seen in a UTF-8 string value.
> > 
> > We only check string value data, so it is possible to use larger code
> > points in comments.
> > 
> > v2:
> >  * Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
> >  * Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit'
> >    commit
> > 
> > Cc: Yingke D Liu <yingke.d....@intel.com>
> > Cc: Michael D Kinney <michael.d.kin...@intel.com>
> > Contributed-under: TianoCore Contribution Agreement 1.0
> > Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com>
> > ---
> >  BaseTools/Source/Python/AutoGen/UniClassObject.py | 38 
> > +++++++++++++++++++++--
> >  1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py 
> > b/BaseTools/Source/Python/AutoGen/UniClassObject.py
> > index aa54f4f..41448ab 100644
> > --- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
> > +++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
> > @@ -209,7 +209,7 @@ class UniFileClassObject(object):
> >          Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
> >          if len(Lang) != 3:
> >              try:
> > -                FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
> > encoding='utf-16').read()
> > +                FileIn = self.OpenUniFile(LongFilePath(File.Path))
> >              except UnicodeError, X:
> >                  EdkLogger.error("build", FILE_READ_FAILURE, "File read 
> > failure: %s" % str(X), ExtraData=File);
> >              except:
> > @@ -253,6 +253,38 @@ class UniFileClassObject(object):
> >                      self.OrderedStringDict[LangName][Item.StringName] = 
> > len(self.OrderedStringList[LangName]) - 1
> >          return True
> >  
> > +    def OpenUniFile(self, FileName):
> > +        Encoding = 'utf-8'
> > +        UniFile = open(FileName, 'rb')
> > +
> > +        #
> > +        # Seek to end of file to determine its size
> > +        #
> > +        UniFile.seek(0, 2)
> > +        FileSize = UniFile.tell()
> > +
> > +        if FileSize >= 2:
> > +            #
> > +            # Seek to start of the file to read the UTF-16 BOM
> > +            #
> > +            UniFile.seek(0, 0)
> > +            Bom = UniFile.read(2)
> > +            UniFile.seek(0, 0)
> > +
> > +            if Bom == '\xff\xfe':
> > +                Encoding = 'utf-16'
> > +
> > +        Info = codecs.lookup(Encoding)
> > +        return codecs.StreamReaderWriter(UniFile, Info.streamreader, 
> > Info.streamwriter)
> > +
> > +    def Verify16bitCodePoints(self, String):
> > +        for cp in String:
> > +            if ord(cp) > 0xffff:
> > +                tmpl = 'The string {} defined in file {} ' + \
> > +                       'contains a character with a code point above 
> > 0xFFFF.'
> > +                error = tmpl.format(repr(String), self.File)
> > +                EdkLogger.error('Unicode File Parser', FORMAT_INVALID, 
> > error)
> > +
> >      #
> >      # Get String name and value
> >      #
> > @@ -274,6 +306,7 @@ class UniFileClassObject(object):
> >                  Language = LanguageList[IndexI].split()[0]
> >                  Value = 
> > LanguageList[IndexI][LanguageList[IndexI].find(u'\"') + len(u'\"') : 
> > LanguageList[IndexI].rfind(u'\"')] #.replace(u'\r\n', u'')
> >                  Language = GetLanguageCode(Language, 
> > self.IsCompatibleMode, self.File)
> > +                self.Verify16bitCodePoints(Value)
> >                  self.AddStringToList(Name, Language, Value)
> >  
> >      #
> > @@ -305,7 +338,7 @@ class UniFileClassObject(object):
> >              EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, 
> > ExtraData=File.Path)
> >  
> >          try:
> > -            FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
> > encoding='utf-16')
> > +            FileIn = self.OpenUniFile(LongFilePath(File.Path))
> >          except UnicodeError, X:
> >              EdkLogger.error("build", FILE_READ_FAILURE, "File read 
> > failure: %s" % str(X), ExtraData=File.Path);
> >          except:
> > @@ -426,6 +459,7 @@ class UniFileClassObject(object):
> >                      MatchString = re.match('[A-Z0-9_]+', Name, re.UNICODE)
> >                      if MatchString == None or MatchString.end(0) != 
> > len(Name):
> >                          EdkLogger.error('Unicode File Parser', 
> > FORMAT_INVALID, 'The string token name %s defined in UNI file %s contains 
> > the invalid lower case character.' %(Name, self.File))
> > +                self.Verify16bitCodePoints(Value)
> >                  self.AddStringToList(Name, Language, Value)
> >                  continue
> >  
> > -- 
> > 2.1.4
> > 
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Re: [edk2] [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

Reply via email to