subject:"\[gwt\-contrib\] Re\: Unicode support for Character.is\* methods"

Re: [gwt-contrib] Re: Unicode support for Character.is* methods

2020-04-28 Thread 'Goktug Gokdogan' via GWT Contributors

I don't have access to that patch but if it was correct; I'm sure that it
was very costly though.

A recent attempt looked like following:

  private static class RegExps {/* Formatted strings for the
fallback regular expressions were generated with the following *
Python 3 script.from unicodedata import bidirectional, category
from itertools import *condition = lambda x:
category(chr(x)).startswith("L")  # adjust thisuchr = lambda x:
"u%04X" % x if bidirectional(chr(x)) in ("R", "AN", "AL") else
chr(x)ranges = []codepoints = [x for x in range(0, 0x) if
condition(x)]for _, group in groupby(enumerate(codepoints), lambda
t: t[0] - t[1]):l = list(item for (index, item) in group)
  ranges += [uchr(l[0]) + ("-" if len(l) > 2 else "") + (uchr(l[-1])
if len(l) > 1 else "")]formatted = "\"["for r in
ranges:if len(formatted) + len(r) > 99:
print(formatted + "\"")formatted = "+ \""
  formatted += rprint(formatted + "]\";") */private static
final String LETTER_FALLBACK =
"[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙա-և\\u05D0-\\u05EA\\u05F0-\\u05F2"
   + 
"\\u0620-\\u064A\\u066E\\u066F\\u0671-\\u06D3\\u06D5\\u06E5\\u06E6\\u06EE\\u06EF"
   + 
"\\u06FA-\\u06FC\\u06FF\\u0710\\u0712-\\u072F\\u074D-\\u07A5\\u07B1\\u07CA-\\u07EA"
   + 
"\\u07F4\\u07F5\\u07FA\\u0800-\\u0815\\u081A\\u0824\\u0828\\u0840-\\u0858"
   + 
"\\u08A0-\\u08B4\\u08B6-\\u08BDऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএঐও-নপ-রলশ-হঽৎড়ঢ়য়-ৡৰৱਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹ"
   + 
"ਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલળવ-હઽૐૠૡૹଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହଽଡ଼ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹௐఅ-ఌఎ-ఐ"
   + 
"ఒ-నప-హఽౘ-ౚౠౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠೡೱೲഅ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาำเ-ๆກຂຄງຈຊຍ"
   + 
"ດ-ທນ-ຟມ-ຣລວສຫອ-ະາຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍ"
   + 
"ነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡷᢀ-ᢄᢇ-ᢨᢪ"
   + 
"ᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᳩ-ᳬᳮ-ᳱᳵᳶᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώ"
   + 
"ᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯ"
   + 
"ⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々〆〱-〵〻〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄭㄱ-ㆎㆠ-ㆺㇰ-ㇿ㐀-䶵一-鿕ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪꘫꙀ-ꙮ"
   + 
"ꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞮꞰ-ꞷꟷ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵꪶ"
   + 
"ꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭥꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗ\\uFB1D\\uFB1F-\\uFB28"
   + 
"\\uFB2A-\\uFB36\\uFB38-\\uFB3C\\uFB3E\\uFB40\\uFB41\\uFB43\\uFB44\\uFB46-\\uFBB1"
   + 
"\\uFBD3-\\uFD3D\\uFD50-\\uFD8F\\uFD92-\\uFDC7\\uFDF0-\\uFDFB\\uFE70-\\uFE74"
   + "\\uFE76-\\uFEFCＡ-Ｚａ-ｚｦ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ]";private
static final String DIGIT_FALLBACK =
"[0-9\\u0660-\\u0669۰-۹\\u07C0-\\u07C9०-९০-৯੦-੯૦-૯୦-୯௦-௯౦-౯೦-೯൦-൯෦-෯๐-๙໐-໙༠-༩၀-၉႐-႙០-៩᠐-᠙"
   + "᥆-᥏᧐-᧙᪀-᪉᪐-᪙᭐-᭙᮰-᮹᱀-᱉᱐-᱙꘠-꘩꣐-꣙꤀-꤉꧐-꧙꧰-꧹꩐-꩙꯰-꯹０-９]";
private static final String LOWER_CASE_FALLBACK =
"[a-zµß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĳĵķĸĺļľŀłńņňŉŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌƍƒƕƙ-ƛƞơƣ"
   + 
"ƥƨƪƫƭưƴƶƹƺƽ-ƿǆǉǌǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǳǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ-ʓʕ-ʯͱ"
   + 
"ͳͷͻ-ͽΐά-ώϐϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊ"
   + 
"ӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯա-ևᏸ-ᏽᲀ-ᲈᴀ-ᴫᵫ-ᵷᵹ-ᶚḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝ"
   + 
"ḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉ"
   + 
"ịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰ-ώᾀ-ᾇᾐ-ᾗᾠ-ᾧᾰ-ᾴᾶᾷιῂ-ῄῆῇῐ-ΐῖῗῠ-ῧῲ-ῴῶῷℊ"
   + 
"ℎℏℓℯℴℹℼℽⅆ-ⅉⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⳬⳮⳳ"
   + 
"ⴀ-ⴥⴧⴭꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙡꙣꙥꙧꙩꙫꙭꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꚙꚛꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯ"
   + "ꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓ-ꞕꞗꞙꞛꞝꞟꞡꞣꞥꞧꞩꞵꞷꟺꬰ-ꭚꭠ-ꭥꭰ-ꮿﬀ-ﬆﬓ-ﬗａ-ｚ]";
private static final String UPPER_CASE_FALLBACK =
"[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİĲĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƉ-ƋƎ-ƑƓƔƖ-Ƙ"
   + 
"ƜƝƟƠƢƤƦƧƩƬƮƯƱ-ƳƵƷƸƼǄǇǊǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮǱǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃ-ɆɈɊɌ"
   + 
"ɎͰͲͶͿΆΈ-ΊΌΎΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼ"
   + 
"ҾӀӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-ՖႠ-ჅჇჍᎠ-ᏵḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞ"
   + 
"ḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎ"
   + 
"ỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾℿ"
   + 
"ⅅↃⰀ-ⰮⱠⱢ-ⱤⱧⱩⱫⱭ-ⱰⱲⱵⱾ-ⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖ"
   + 
"ꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦ"
   + "ꞨꞪ-ꞮꞰ-ꞴꞶＡ-Ｚ]";private static final String
TITLE_CASE_FALLBACK = "[ǅǈǋǲᾈ-ᾏᾘ-ᾟᾨ-ᾯᾼῌῼ]";// The expressions
below were generated by looping over all valid Unicode code points
using the// desktop version of Java.private static final
NativeRegExp WHITESPACE =new NativeRegExp(
"[\\u0009-\\u000D\\u001C-\\u002

Re: [gwt-contrib] Re: Unicode support for Character.is* methods

2020-04-28 Thread Mark Proctor

This was never merged in the end, Are the patch sets still available?

On Thursday, 1 April 2010 16:11:49 UTC+1, John Tamplin wrote:
>
> On Wed, Mar 31, 2010 at 5:43 PM, Pascal Muetschard  > wrote:
>
>> I have uploaded another patch set to 
>> http://gwt-code-reviews.appspot.com/226801
>> to address the concerns raised. See inline messages below.
>>
>
> Thanks for your efforts -- it will be next week before I can look closely 
> at it.
>  
> -- 
> John A. Tamplin
> Software Engineer (GWT), Google
>

-- 
You received this message because you are subscribed to the Google Groups "GWT 
Contributors" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-web-toolkit-contributors+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-web-toolkit-contributors/01deac40-1eff-41c0-bf1a-9bd64e90ab7e%40googlegroups.com.

Re: [gwt-contrib] Re: Unicode support for Character.is* methods

2010-04-01 Thread John Tamplin

On Wed, Mar 31, 2010 at 5:43 PM, Pascal Muetschard <
pmuetschard...@google.com > wrote:

> I have uploaded another patch set to
> http://gwt-code-reviews.appspot.com/226801
> to address the concerns raised. See inline messages below.
>

Thanks for your efforts -- it will be next week before I can look closely at
it.

-- 
John A. Tamplin
Software Engineer (GWT), Google

-- 
http://groups.google.com/group/Google-Web-Toolkit-Contributors

To unsubscribe, reply using "remove me" as the subject.

[gwt-contrib] Re: Unicode support for Character.is* methods

2010-03-31 Thread Pascal Muetschard

I have uploaded another patch set to http://gwt-code-reviews.appspot.com/226801
to address the concerns raised. See inline messages below.

This latest version has an ASCII only option for the is*() methods
witch has an overhead of a couple hundred bytes. See below for the
size penalties for the tables:

is*() methods: 5396 (all of them, not each)
getDirectionality(): 2627
getType(): 4112
getNumericValue: 2163
digit(char,int): 6779 (uses isDigit() and getNumericValue())

TOTAL: 11681 (at savings of 2617 - the shared code between each is
about 700 bytes)

I feel like the size argument is well met - if using all the tables,
the penalty is a mere 11k - comparing with the bootstrap code usually
at 5k and HashMap at 15k, that's quite small.

On Mar 16, 7:51 pm, j...@google.com wrote:
> A few issues:
>
> - the way this is divided, all of the code will get pulled into every
> app that calls any of these methods.

I had thought about separating the tables out for each of the is*()
methods, however, the extra code each of the objects adds quickly
becomes much larger than the data of the tables. This means that we
would sacrifice the runtime size of the common case for the corner
case where only a single is*() method is used. Also, the inherit
relationship (i.e. if isDefined is false, all others are false as
well) and mutual exclusion (isUpperCase vs isLowerCase vs isDigit)
between the attributes cut out a lot of duplication by combing the
tables.

I have also added the deferred property and your "ASCII version."
However, I've made unicode the default, as I feel like that's what
people expect from GWT - to be i18n compatible by default.

> - this is incomplete and doesn't have the other properties, such as
> getDirection, toLower, etc.

I've added getDirectionality(), getType(), getNumericValue() and
digit(char,int). I have excluded the to*Case() methods on purpose, as
their definition is not i18n correct - there are upper case characters
that need more than one character in lower case and vice versa.

>
> I had written a full implementation a while ago (it is still available
> in svn at changes/jat/ucd), which encoded each table separately with a
> combination of run-length encoding and huffman coding the runs, which
> got the size of individual tables down to a few hundred bytes each, and
> you only paid for the tables that were used.  The decompression code was
> of course larger, so maybe there is room for a simpler encoding
> mechanism that takes less code even if the data is larger.

I have looked at this and mine is similar. My version also encodes run
lengths and makes sure that the most common "tokens" have the smallest
representation. It also uses LZW to compress the data. This
compression is simple and needs a lot less code to decompress, but
still provides a good compression ratio.

>
> That effort was complete but was never merged in because some people
> objected to the code size increase.  Given the synchronous nature of the
> API, it isn't feasible to fetch the tables on-demand from a server, so
> they have to be downloaded with the code (they can go into different
> runAsync fragments though).
>
> I hope to work on that and other i18n issues next quarter, but I am not
> sure how much time I will have to work on it.

Which is why I'm trying to help with this :)

>
> http://gwt-code-reviews.appspot.com/226801

-- 
http://groups.google.com/group/Google-Web-Toolkit-Contributors

To unsubscribe, reply using "remove me" as the subject.

[gwt-contrib] Re: Unicode support for Character.is* methods

2010-03-16 Thread jat


A few issues:

- the way this is divided, all of the code will get pulled into every
app that calls any of these methods.
- this is incomplete and doesn't have the other properties, such as
getDirection, toLower, etc.

I had written a full implementation a while ago (it is still available
in svn at changes/jat/ucd), which encoded each table separately with a
combination of run-length encoding and huffman coding the runs, which
got the size of individual tables down to a few hundred bytes each, and
you only paid for the tables that were used.  The decompression code was
of course larger, so maybe there is room for a simpler encoding
mechanism that takes less code even if the data is larger.

That effort was complete but was never merged in because some people
objected to the code size increase.  Given the synchronous nature of the
API, it isn't feasible to fetch the tables on-demand from a server, so
they have to be downloaded with the code (they can go into different
runAsync fragments though).

I hope to work on that and other i18n issues next quarter, but I am not
sure how much time I will have to work on it.


http://gwt-code-reviews.appspot.com/226801

--
http://groups.google.com/group/Google-Web-Toolkit-Contributors

Re: [gwt-contrib] Re: Unicode support for Character.is* methods

Re: [gwt-contrib] Re: Unicode support for Character.is* methods

Re: [gwt-contrib] Re: Unicode support for Character.is* methods

[gwt-contrib] Re: Unicode support for Character.is* methods

[gwt-contrib] Re: Unicode support for Character.is* methods

5 matches

Site Navigation

Mail list logo

Footer information