Roman Akopov <ado...@gmail.com> added the comment: This is how I extract data from Common Locale Data Repository v37 script assumes common\main working directory
from os import walk from xml.etree import ElementTree en_root = ElementTree.parse('en.xml') for (dirpath, dirnames, filenames) in walk('.'): for filename in filenames: if filename.endswith('.xml'): code = filename[:-4] xx_root = ElementTree.parse(filename) xx_lang = xx_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']') en_lang = en_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']') if en_lang.text == 'Cherokee': print(en_lang.text) print(xx_lang.text) print(xx_lang.text.encode("unicode_escape")) print(xx_lang.text.encode('idna')) print(ord(xx_lang.text[0])) print(ord(xx_lang.text[1])) print(ord(xx_lang.text[2])) script outputs Cherokee ᏣᎳᎩ b'\\u13e3\\u13b3\\u13a9' b'xn--tz9ata7l' 5091 5043 5033 If I change text to lower case print(en_lang.text.lower()) print(xx_lang.text.lower()) print(xx_lang.text.lower().encode("unicode_escape")) print(xx_lang.text.lower().encode('idna')) print(ord(xx_lang.text.lower()[0])) print(ord(xx_lang.text.lower()[1])) print(ord(xx_lang.text.lower()[2])) then script outputs cherokee ꮳꮃꭹ b'\\uabb3\\uab83\\uab79' b'xn--tz9ata7l' 43955 43907 43897 I am not sure where do you get '\u13e3\u13b3\u13a9' string. '\u13e3\u13b3\u13a9'.lower().encode('unicode_escape') gives b'\\uabb3\\uab83\\uab79' ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue40845> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com