I have done some exploratory work on this sort of thing (implementing python's UnicodeData) and a _correct_ implementation is both large and difficult.
A full unicode properties file I created was 615k, which I think is about as large as the entire elixir standard library. And that is without the east asian character set. I think it would make a good 3rd party lib though. I never finished the work, but I can throw it up on github if someone else want to finish it. On Wednesday, May 4, 2016 at 9:05:21 AM UTC-7, Peter Marreck wrote: > > As an example of what would need to be done by necessity for proper > compliance with Unicode spec, check out the "Derived Property: Alphabetic" > codepoint list section of this doc: > > ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt > > "Total code points: 110943" > > And that's just for the "is_alphabetic?" function! (Sure, this would be > macroed out, but as Eric said, it would definitely increase the binary size > further...) > > I still think this is useful functionality (and would likely be many > orders of magnitude faster than relying on Regex to determine these things > due to Elixir/Erlang's fast function-head pattern matching) > > -- > Peter Marreck > > On Tuesday, May 3, 2016 at 6:24:33 PM UTC-4, Eric Meadows-Jönsson wrote: >> >> The problem is that the Unicode module is already big, the file size of >> the .beam file is one of the largest in elixir. There are also issues >> compiling this file on systems with 512mb memory. idna, an erlang library >> for unicode, have similar issues on systems with low memory. Adding more >> functions that will need a large number of function clauses will make the >> issue worse and the size of the compiled elixir we distribute larger. >> >> I think it's better to have this functionality in a library until we can >> solve the memory issue and only have the bare necessities for unicode >> support in stdlib. If we later can move it into stdlib it would be good to >> have the API figured out and bugs fixed in another library that can iterate >> faster. >> >> On Tue, May 3, 2016 at 11:29 PM, eksperimental <[email protected]> >> wrote: >> >>> I'm not too sure if we should have all those many functions should be >>> added. it could be too many of them, and not easy to extend.. >>> but how about an Unicode.info/1 function, that returns a tuple with >>> information about that character. such as >>> iex> Unicode.info("A") >>> ...> {:alphanumeric, :uppercase, :ascii} >>> >>> It will be easy to improve as we find more information can be added, >>> such as ISO types and other groups (Specially to encodings we are not >>> familiar with) >>> >>> Additionally we could have check?/2 (or some better name probably!) >>> iex> Unicode.check?("A", :uppercase) >>> ...> true >>> iex> Unicode.check?("A", :numeric) >>> ...> false >>> >>> >>> created, but On Tue, 3 May 2016 12:31:44 -0700 (PDT) >>> [email protected] wrote: >>> >>> > I have seen multiple people (In the Elixir Slack group >>> > <https://elixir-lang.slack.com/archives/general/p1462294660007855>, >>> > on Reddit >>> > < >>> https://www.reddit.com/r/elixir/comments/4h4y4e/whats_missing_from_the_elixir_ecosystem/d2nvbwd >>> >) >>> > during the last couple of days requiring something that checks if a >>> > (possibly long) string contains e.g. only alphanumeric characters. >>> > >>> > It is possible to do this using regular expressions right now: >>> > ~r/[^[:alnum:]]/u >>> > >>> > but this is very slow. >>> > >>> > My proposal is to add the following boolean functions to the String >>> > module: >>> > >>> > >>> > - alphabetic? >>> > - numeric? >>> > - alphanumeric? >>> > - whitespace? >>> > - uppercase? >>> > - lowercase? >>> > - control_character? >>> > >>> > >>> > Function heads for these functions can probably be best generated by >>> > using compile-time macros similar to what other unicode-based >>> > functions already use. >>> > >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elixir-lang-core" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elixir-lang-core/20160504042910.57fd86e0.eksperimental%40autistici.org >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> Eric Meadows-Jönsson >> > -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/3f1114cd-3110-4056-8770-bc4690930b9d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
