Jordan, My main point about the spec is that both need to be updated if this concept is accepted. Not that one has to be done before the other.
Mike -----Original Message----- From: Justen, Jordan L Sent: Tuesday, May 05, 2015 11:52 AM To: Kinney, Michael D; edk2-devel@lists.sourceforge.net Cc: Liu, Yingke D Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files On 2015-05-05 11:32:57, Kinney, Michael D wrote: > Jordan, > > Here is the link to the EDK II spec that states that UNI files must > be UTF16-LE > > http://sourceforge.net/projects/edk2/files/Specifications/UNI_File_Spec_v1_2_Errata_A.pdf/download I don't know where the source code is for this document, so I don't see how I can update it. I think adding this support to EDK II will still mean that BaseTools supports the current version of the spec since it can still handle UTF-16 files. The last patch to convert the file in OvmfPkg to UTF-8 may be invalid by the spec. Anyway, I think the more natural way to make proceed is to add UTF-8 .uni support to EDK II and then to document the support in the spec. Do you disagree? I don't want to waste any more time on this, if we can't add it to EDK II because it is not in the spec. Likewise, I don't see how we can add it to the spec if it is not supported by EDK II. > I think there are good reasons to verify that the entire UNI file > only contains UCS-2 characters. Editors are usually in a single > Unicode mode, so it would be difficult when using an editor in an > extended mode to know if only the string contents are UCS-2. Also, > if we decide to add helper tools to support conversion to UTF-16LE > to follow current the EDK II UNI specification for releases or > compatibility with other tools, then what would that helper tool do > with a non UCS-2 character? It is just easier for the whole file to > use the same character set. Ok. That sounds like a good reason. -Jordan > -----Original Message----- > From: Justen, Jordan L > Sent: Tuesday, May 05, 2015 10:53 AM > To: Kinney, Michael D; edk2-devel@lists.sourceforge.net > Cc: Liu, Yingke D > Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files > > On 2015-05-05 09:07:11, Kinney, Michael D wrote: > > Jordan, > > > > If we are going to add support for more UNI file formats, there are > > also EDK II specifications that must be updated. > > I don't know about that. I at least looked under > BaseTools/UserManuals, and didn't find anything obvious to update. > > I'm not sure who owns updating other specs, but they can probably make > the changes to the spec when new features appear in a code base. (The > old UTF-16 support should still work as spec'd.) Wouldn't this be more > of a concern for future UDK releases? > > > I am not sure I agree with only checking that the string value has > > supported Unicode characters. If the Name or Language elements have > > unsupported Unicode characters, then that will cause problems too. I > > think I would prefer the entire file, including all comment > > lines/blocks to only contain the supported Unicode characters. > > If you are referring to comment in the UTF-8 version of the file, why > constrain them? Obviously there are technical reasons for constraining > string values, but the same need not apply to comments. > > > I like the addition of OpenUniFile() so we have one place to update > > ... open place, except apparently UPT has duplicated most of this > code. :) > > -Jordan > > > if we decide to add support for more file formats. Please look at > > the code fragments below that I think can simplify the logic and > > make it more readable/maintainable by taking advantage of more of > > the codecs module functions and constants. These code fragments are > > not based on the current trunk, so there may be some unexpected > > differences. > > > > The codecs module has some constants that can improve the > > readability of this logic. The following code fragment detects the > > BOM marker and determines the encoding. > > > > # > > # Read file > > # > > try: > > FileIn = open (LongFilePath(File.Path), mode='rb').read() > > except: > > EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File) > > > > > > # > > # Detect Byte Order Mark at beginning of file. Default to UTF-8 > > # > > Encoding = 'utf-8' > > if FileIn.startswith (codecs.BOM_UTF16_BE): > > Encoding = 'utf-16be' > > FileIn = FileIn.lstrip (codecs.BOM_UTF16_BE) > > elif FileIn.startswith (codecs.BOM_UTF16_LE): > > Encoding = 'utf-16le' > > FileIn = FileIn.lstrip (codecs.BOM_UTF16_LE) > > elif FileIn.startswith (codecs.BOM_UTF8): > > Encoding = 'utf-8-sig' > > FileIn = FileIn.lstrip (codecs.BOM_UTF8) > > > > The following code fragment uses the codecs module and the encoding > > detected above to verify that all the characters in a UNI file are legal > > UCS-2 characters. If an invalid character is detected, then additional > > logic is run in an except clause to determine the line number of the > > invalid character. > > > > # > > # Convert to unicode > > # > > try: > > FileIn = codecs.decode (FileIn, Encoding) > > Verify = codecs.encode (FileIn, 'utf-16') > > Verify = codecs.decode (Verify, 'utf-16') > > except: > > FileIn = codecs.open (LongFilePath(File.Path), > > encoding=Encoding, mode='r') > > LineNumber = 0 > > while True: > > LineNumber = LineNumber + 1 > > try: > > Line = FileIn.readline() > > if Line == '': > > EdkLogger.error('Unicode File Parser', > > PARSER_ERROR, '%s contains invalid UCS-2 characters.' % (File.Path)) > > Line = codecs.encode (Line, 'utf-16') > > Line = codecs.decode (Line, 'utf-16') > > except: > > EdkLogger.error('Unicode File Parser', > > PARSER_ERROR, '%s contains invalid UCS-2 character at line %d.' % > > (File.Path, LineNumber)) > > > > Best regards, > > > > Mike > > > > -----Original Message----- > > From: Justen, Jordan L > > Sent: Tuesday, May 05, 2015 12:09 AM > > To: edk2-devel@lists.sourceforge.net > > Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D > > Subject: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files > > > > Since UEFI only support UTF-16LE strings internally, this simply > > allows for another unicode the source file encoding. > > > > The strings are still converted to UTF-16LE data for use in EDK II > > source code. > > > > When .uni files contain UTF-16 data, it is impossible for unicode code > > points to be larger than 0xFFFF. To support .uni files that contain > > UTF-8 data, we also need to also deal with the possibility that the > > UTF-8 file contains unicode code points larger than 16-bits. Since > > UEFI only supports 16-bit string data, we make UniClassObject generate > > an error if a larger code point is seen in a UTF-8 string value. > > > > We only check string value data, so it is possible to use larger code > > points in comments. > > > > v2: > > * Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney) > > * Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit' > > commit > > > > Cc: Yingke D Liu <yingke.d....@intel.com> > > Cc: Michael D Kinney <michael.d.kin...@intel.com> > > Contributed-under: TianoCore Contribution Agreement 1.0 > > Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com> > > --- > > BaseTools/Source/Python/AutoGen/UniClassObject.py | 38 > > +++++++++++++++++++++-- > > 1 file changed, 36 insertions(+), 2 deletions(-) > > > > diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py > > b/BaseTools/Source/Python/AutoGen/UniClassObject.py > > index aa54f4f..41448ab 100644 > > --- a/BaseTools/Source/Python/AutoGen/UniClassObject.py > > +++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py > > @@ -209,7 +209,7 @@ class UniFileClassObject(object): > > Lang = distutils.util.split_quoted((Line.split(u"//")[0])) > > if len(Lang) != 3: > > try: > > - FileIn = codecs.open(LongFilePath(File.Path), mode='rb', > > encoding='utf-16').read() > > + FileIn = self.OpenUniFile(LongFilePath(File.Path)) > > except UnicodeError, X: > > EdkLogger.error("build", FILE_READ_FAILURE, "File read > > failure: %s" % str(X), ExtraData=File); > > except: > > @@ -253,6 +253,38 @@ class UniFileClassObject(object): > > self.OrderedStringDict[LangName][Item.StringName] = > > len(self.OrderedStringList[LangName]) - 1 > > return True > > > > + def OpenUniFile(self, FileName): > > + Encoding = 'utf-8' > > + UniFile = open(FileName, 'rb') > > + > > + # > > + # Seek to end of file to determine its size > > + # > > + UniFile.seek(0, 2) > > + FileSize = UniFile.tell() > > + > > + if FileSize >= 2: > > + # > > + # Seek to start of the file to read the UTF-16 BOM > > + # > > + UniFile.seek(0, 0) > > + Bom = UniFile.read(2) > > + UniFile.seek(0, 0) > > + > > + if Bom == '\xff\xfe': > > + Encoding = 'utf-16' > > + > > + Info = codecs.lookup(Encoding) > > + return codecs.StreamReaderWriter(UniFile, Info.streamreader, > > Info.streamwriter) > > + > > + def Verify16bitCodePoints(self, String): > > + for cp in String: > > + if ord(cp) > 0xffff: > > + tmpl = 'The string {} defined in file {} ' + \ > > + 'contains a character with a code point above > > 0xFFFF.' > > + error = tmpl.format(repr(String), self.File) > > + EdkLogger.error('Unicode File Parser', FORMAT_INVALID, > > error) > > + > > # > > # Get String name and value > > # > > @@ -274,6 +306,7 @@ class UniFileClassObject(object): > > Language = LanguageList[IndexI].split()[0] > > Value = > > LanguageList[IndexI][LanguageList[IndexI].find(u'\"') + len(u'\"') : > > LanguageList[IndexI].rfind(u'\"')] #.replace(u'\r\n', u'') > > Language = GetLanguageCode(Language, > > self.IsCompatibleMode, self.File) > > + self.Verify16bitCodePoints(Value) > > self.AddStringToList(Name, Language, Value) > > > > # > > @@ -305,7 +338,7 @@ class UniFileClassObject(object): > > EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, > > ExtraData=File.Path) > > > > try: > > - FileIn = codecs.open(LongFilePath(File.Path), mode='rb', > > encoding='utf-16') > > + FileIn = self.OpenUniFile(LongFilePath(File.Path)) > > except UnicodeError, X: > > EdkLogger.error("build", FILE_READ_FAILURE, "File read > > failure: %s" % str(X), ExtraData=File.Path); > > except: > > @@ -426,6 +459,7 @@ class UniFileClassObject(object): > > MatchString = re.match('[A-Z0-9_]+', Name, re.UNICODE) > > if MatchString == None or MatchString.end(0) != > > len(Name): > > EdkLogger.error('Unicode File Parser', > > FORMAT_INVALID, 'The string token name %s defined in UNI file %s contains > > the invalid lower case character.' %(Name, self.File)) > > + self.Verify16bitCodePoints(Value) > > self.AddStringToList(Name, Language, Value) > > continue > > > > -- > > 2.1.4 > > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ edk2-devel mailing list edk2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/edk2-devel