http://d.puremagic.com/issues/show_bug.cgi?id=5543
--- Comment #11 from Dmitry Olshansky <dmitry.o...@gmail.com> 2012-12-21 08:00:56 PST --- (In reply to comment #10) > (In reply to comment #5) > > > > I'm wrapping up a revamp of std.uni that makes it piece of cake to create > > character sets. And maps are converted to multi-staged tables that are > > faster > > the binary search on a large set. I'd suggest to wait a bit on it (so as to > > not > > duplicate work) and introduce only std.ascii version as the most useful. > > > > The ongoing polishing, fixing and testing against ICU is going on here: > > https://github.com/blackwhale/gsoc-bench-2012 > > OK: The thing I was having trouble though is that existing binary search > returns a bool, whereas I need the actual entry, so I can do "value - > entry[0]", eg: > > //---- > static immutable dchar[2][] table1 = [ > [ 0x0030, 0x0039], // > [ 0x0660, 0x0669], //ARABIC-INDIC > [ 0x06F0, 0x06F9], //EXTENDED ARABIC-INDIC > > ... > //--- > That's because all the entries in [Nd] are consecutive numerals starting at 0. > I can also cram a select couple of entries from [Nl] and [Po] that also use > this scheme. > Sometimes I was able to abuse the natural format of data and sometimes failed. But what proved to be quite good is varying sizes of multi-staged rable to match "periods" of data. In the end if the data has a lot of common "rows" a multi-staged table of certain size per stage is bound hit a sweet spot. > So if I have the unicode 0x0665 (The ARABIC-INDIC numeral '6'), I'd want to > find [ 0x0660, 0x0669], and then "return 0x0665 - 0x0660". > > Well, I don't need the entire pair, but at least the lhs of the pair. > > If you could keep that in mind during your re-write. Or not. Just throwing it > out there. > > For all other entries in [Nl] and [Po], I'd have: > static immutable dchar[2][] table1 = [ > [ 0x261D, 100], //ROMAN NUMERAL ONE HUNDRED > > So that's just basic dictionary. But I don't think you can statically allocate > an AA. So yeah, just throwing that your direction too. > Well, AA is a fat pig w.r.t RAM usage. But thanks anyway. > > > The file is too large for std.xml to handle, so it's back to C++ for me :/ > > > > > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt > > > > Same thing but no useless XML trash. Description of fields is somewhere in > > the > > middle of this document > > http://www.unicode.org/reports/tr44/ > > Nice, TY. > > > > The only questions I have is: > > > Return value: int or double? > > > > Should be rational to acurately represent things like "1/5" character ;) > > I do suspect some simple custom type could do (2 shorts packed in one struct > > etc.). > > > > > Input is not numeric: -1 or exception? > > > > -1 is fine I think as this rather low level (per character) and it's not at > > all > > convenient to throw (and then catch). > > The only issue I have with returning -1 is that it is a magic value. The fact > that there is no unicode for -1 is pure coincidence, and not by design. In > particular, any attempt to write "if (numericValue(c) < 0) fail" would also be > wrong because: > http://unicode.org/cldr/utility/character.jsp?a=0F33 > The TIBETAN DIGIT HALF ZERO returns -0.5 > > Do we *really* want to standardize the syntax of "if (numericValue(c) < -0.7)" > ? > > ... > > Damn you unicode! Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------