Re: [gwt-contrib] Re: Unicode support for Character.is* methods
I don't have access to that patch but if it was correct; I'm sure that it was very costly though. A recent attempt looked like following: private static class RegExps {/* Formatted strings for the fallback regular expressions were generated with the following * Python 3 script.from unicodedata import bidirectional, category from itertools import *condition = lambda x: category(chr(x)).startswith("L") # adjust thisuchr = lambda x: "u%04X" % x if bidirectional(chr(x)) in ("R", "AN", "AL") else chr(x)ranges = []codepoints = [x for x in range(0, 0x) if condition(x)]for _, group in groupby(enumerate(codepoints), lambda t: t[0] - t[1]):l = list(item for (index, item) in group) ranges += [uchr(l[0]) + ("-" if len(l) > 2 else "") + (uchr(l[-1]) if len(l) > 1 else "")]formatted = "\"["for r in ranges:if len(formatted) + len(r) > 99: print(formatted + "\"")formatted = "+ \"" formatted += rprint(formatted + "]\";") */private static final String LETTER_FALLBACK = "[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙա-և\\u05D0-\\u05EA\\u05F0-\\u05F2" + "\\u0620-\\u064A\\u066E\\u066F\\u0671-\\u06D3\\u06D5\\u06E5\\u06E6\\u06EE\\u06EF" + "\\u06FA-\\u06FC\\u06FF\\u0710\\u0712-\\u072F\\u074D-\\u07A5\\u07B1\\u07CA-\\u07EA" + "\\u07F4\\u07F5\\u07FA\\u0800-\\u0815\\u081A\\u0824\\u0828\\u0840-\\u0858" + "\\u08A0-\\u08B4\\u08B6-\\u08BDऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএঐও-নপ-রলশ-হঽৎড়ঢ়য়-ৡৰৱਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹ" + "ਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલળવ-હઽૐૠૡૹଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହଽଡ଼ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹௐఅ-ఌఎ-ఐ" + "ఒ-నప-హఽౘ-ౚౠౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠೡೱೲഅ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาำเ-ๆກຂຄງຈຊຍ" + "ດ-ທນ-ຟມ-ຣລວສຫອ-ະາຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍ" + "ነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡷᢀ-ᢄᢇ-ᢨᢪ" + "ᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᳩ-ᳬᳮ-ᳱᳵᳶᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώ" + "ᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯ" + "ⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々〆〱-〵〻〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄭㄱ-ㆎㆠ-ㆺㇰ-ㇿ㐀-䶵一-鿕ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪꘫꙀ-ꙮ" + "ꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞮꞰ-ꞷꟷ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵꪶ" + "ꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭥꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗ\\uFB1D\\uFB1F-\\uFB28" + "\\uFB2A-\\uFB36\\uFB38-\\uFB3C\\uFB3E\\uFB40\\uFB41\\uFB43\\uFB44\\uFB46-\\uFBB1" + "\\uFBD3-\\uFD3D\\uFD50-\\uFD8F\\uFD92-\\uFDC7\\uFDF0-\\uFDFB\\uFE70-\\uFE74" + "\\uFE76-\\uFEFCA-Za-zヲ-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ]";private static final String DIGIT_FALLBACK = "[0-9\\u0660-\\u0669۰-۹\\u07C0-\\u07C9०-९০-৯੦-੯૦-૯୦-୯௦-௯౦-౯೦-೯൦-൯෦-෯๐-๙໐-໙༠-༩၀-၉႐-႙០-៩᠐-᠙" + "᥆-᥏᧐-᧙᪀-᪉᪐-᪙᭐-᭙᮰-᮹᱀-᱉᱐-᱙꘠-꘩꣐-꣙꤀-꤉꧐-꧙꧰-꧹꩐-꩙꯰-꯹0-9]"; private static final String LOWER_CASE_FALLBACK = "[a-zµß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıijĵķĸĺļľŀłńņňʼnŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌƍƒƕƙ-ƛƞơƣ" + "ƥƨƪƫƭưƴƶƹƺƽ-ƿdžljnjǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰdzǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ-ʓʕ-ʯͱ" + "ͳͷͻ-ͽΐά-ώϐϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊ" + "ӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯա-ևᏸ-ᏽᲀ-ᲈᴀ-ᴫᵫ-ᵷᵹ-ᶚḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝ" + "ḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉ" + "ịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰ-ώᾀ-ᾇᾐ-ᾗᾠ-ᾧᾰ-ᾴᾶᾷιῂ-ῄῆῇῐ-ΐῖῗῠ-ῧῲ-ῴῶῷℊ" + "ℎℏℓℯℴℹℼℽⅆ-ⅉⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⳬⳮⳳ" + "ⴀ-ⴥⴧⴭꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙡꙣꙥꙧꙩꙫꙭꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꚙꚛꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯ" + "ꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓ-ꞕꞗꞙꞛꞝꞟꞡꞣꞥꞧꞩꞵꞷꟺꬰ-ꭚꭠ-ꭥꭰ-ꮿff-stﬓ-ﬗa-z]"; private static final String UPPER_CASE_FALLBACK = "[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƉ-ƋƎ-ƑƓƔƖ-Ƙ" + "ƜƝƟƠƢƤƦƧƩƬƮƯƱ-ƳƵƷƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃ-ɆɈɊɌ" + "ɎͰͲͶͿΆΈ-ΊΌΎΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼ" + "ҾӀӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-ՖႠ-ჅჇჍᎠ-ᏵḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞ" + "ḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎ" + "ỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾℿ" + "ⅅↃⰀ-ⰮⱠⱢ-ⱤⱧⱩⱫⱭ-ⱰⱲⱵⱾ-ⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖ" + "ꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦ" + "ꞨꞪ-ꞮꞰ-ꞴꞶA-Z]";private static final String TITLE_CASE_FALLBACK = "[DžLjNjDzᾈ-ᾏᾘ-ᾟᾨ-ᾯᾼῌῼ]";// The expressions below were generated by looping over all valid Unicode code points using the// desktop version of Java.private static final NativeRegExp WHITESPACE =new NativeRegExp( "[\\u0009-\\u000D\\u001C-\\u002
Re: [gwt-contrib] Re: Unicode support for Character.is* methods
This was never merged in the end, Are the patch sets still available? On Thursday, 1 April 2010 16:11:49 UTC+1, John Tamplin wrote: > > On Wed, Mar 31, 2010 at 5:43 PM, Pascal Muetschard > wrote: > >> I have uploaded another patch set to >> http://gwt-code-reviews.appspot.com/226801 >> to address the concerns raised. See inline messages below. >> > > Thanks for your efforts -- it will be next week before I can look closely > at it. > > -- > John A. Tamplin > Software Engineer (GWT), Google > -- You received this message because you are subscribed to the Google Groups "GWT Contributors" group. To unsubscribe from this group and stop receiving emails from it, send an email to google-web-toolkit-contributors+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/google-web-toolkit-contributors/01deac40-1eff-41c0-bf1a-9bd64e90ab7e%40googlegroups.com.
Re: [gwt-contrib] Re: Unicode support for Character.is* methods
On Wed, Mar 31, 2010 at 5:43 PM, Pascal Muetschard < pmuetschard...@google.com > wrote: > I have uploaded another patch set to > http://gwt-code-reviews.appspot.com/226801 > to address the concerns raised. See inline messages below. > Thanks for your efforts -- it will be next week before I can look closely at it. -- John A. Tamplin Software Engineer (GWT), Google -- http://groups.google.com/group/Google-Web-Toolkit-Contributors To unsubscribe, reply using "remove me" as the subject.
[gwt-contrib] Re: Unicode support for Character.is* methods
I have uploaded another patch set to http://gwt-code-reviews.appspot.com/226801 to address the concerns raised. See inline messages below. This latest version has an ASCII only option for the is*() methods witch has an overhead of a couple hundred bytes. See below for the size penalties for the tables: is*() methods: 5396 (all of them, not each) getDirectionality(): 2627 getType(): 4112 getNumericValue: 2163 digit(char,int): 6779 (uses isDigit() and getNumericValue()) TOTAL: 11681 (at savings of 2617 - the shared code between each is about 700 bytes) I feel like the size argument is well met - if using all the tables, the penalty is a mere 11k - comparing with the bootstrap code usually at 5k and HashMap at 15k, that's quite small. On Mar 16, 7:51 pm, j...@google.com wrote: > A few issues: > > - the way this is divided, all of the code will get pulled into every > app that calls any of these methods. I had thought about separating the tables out for each of the is*() methods, however, the extra code each of the objects adds quickly becomes much larger than the data of the tables. This means that we would sacrifice the runtime size of the common case for the corner case where only a single is*() method is used. Also, the inherit relationship (i.e. if isDefined is false, all others are false as well) and mutual exclusion (isUpperCase vs isLowerCase vs isDigit) between the attributes cut out a lot of duplication by combing the tables. I have also added the deferred property and your "ASCII version." However, I've made unicode the default, as I feel like that's what people expect from GWT - to be i18n compatible by default. > - this is incomplete and doesn't have the other properties, such as > getDirection, toLower, etc. I've added getDirectionality(), getType(), getNumericValue() and digit(char,int). I have excluded the to*Case() methods on purpose, as their definition is not i18n correct - there are upper case characters that need more than one character in lower case and vice versa. > > I had written a full implementation a while ago (it is still available > in svn at changes/jat/ucd), which encoded each table separately with a > combination of run-length encoding and huffman coding the runs, which > got the size of individual tables down to a few hundred bytes each, and > you only paid for the tables that were used. The decompression code was > of course larger, so maybe there is room for a simpler encoding > mechanism that takes less code even if the data is larger. I have looked at this and mine is similar. My version also encodes run lengths and makes sure that the most common "tokens" have the smallest representation. It also uses LZW to compress the data. This compression is simple and needs a lot less code to decompress, but still provides a good compression ratio. > > That effort was complete but was never merged in because some people > objected to the code size increase. Given the synchronous nature of the > API, it isn't feasible to fetch the tables on-demand from a server, so > they have to be downloaded with the code (they can go into different > runAsync fragments though). > > I hope to work on that and other i18n issues next quarter, but I am not > sure how much time I will have to work on it. Which is why I'm trying to help with this :) > > http://gwt-code-reviews.appspot.com/226801 -- http://groups.google.com/group/Google-Web-Toolkit-Contributors To unsubscribe, reply using "remove me" as the subject.
[gwt-contrib] Re: Unicode support for Character.is* methods
A few issues: - the way this is divided, all of the code will get pulled into every app that calls any of these methods. - this is incomplete and doesn't have the other properties, such as getDirection, toLower, etc. I had written a full implementation a while ago (it is still available in svn at changes/jat/ucd), which encoded each table separately with a combination of run-length encoding and huffman coding the runs, which got the size of individual tables down to a few hundred bytes each, and you only paid for the tables that were used. The decompression code was of course larger, so maybe there is room for a simpler encoding mechanism that takes less code even if the data is larger. That effort was complete but was never merged in because some people objected to the code size increase. Given the synchronous nature of the API, it isn't feasible to fetch the tables on-demand from a server, so they have to be downloaded with the code (they can go into different runAsync fragments though). I hope to work on that and other i18n issues next quarter, but I am not sure how much time I will have to work on it. http://gwt-code-reviews.appspot.com/226801 -- http://groups.google.com/group/Google-Web-Toolkit-Contributors