Re: [elixir-core:5665] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

jisaacstone Wed, 04 May 2016 12:37:15 -0700

I have done some exploratory work on this sort of thing (implementing 
python's UnicodeData) and a _correct_ implementation is both large and 
difficult.


A full unicode properties file I created was 615k, which I think is about 
as large as the entire elixir standard library.

And that is without the east asian character set.

I think it would make a good 3rd party lib though.

I never finished the work, but I can throw it up on github if someone else 
want to finish it.

On Wednesday, May 4, 2016 at 9:05:21 AM UTC-7, Peter Marreck wrote:
>
> As an example of what would need to be done by necessity for proper 
> compliance with Unicode spec, check out the "Derived Property: Alphabetic" 
> codepoint list section of this doc:
>
> ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
>
> "Total code points: 110943"
>
> And that's just for the "is_alphabetic?" function! (Sure, this would be 
> macroed out, but as Eric said, it would definitely increase the binary size 
> further...)
>
> I still think this is useful functionality (and would likely be many 
> orders of magnitude faster than relying on Regex to determine these things 
> due to Elixir/Erlang's fast function-head pattern matching)
>
> --
> Peter Marreck
>
> On Tuesday, May 3, 2016 at 6:24:33 PM UTC-4, Eric Meadows-Jönsson wrote:
>>
>> The problem is that the Unicode module is already big, the file size of 
>> the .beam file is one of the largest in elixir. There are also issues 
>> compiling this file on systems with 512mb memory. idna, an erlang library 
>> for unicode, have similar issues on systems with low memory. Adding more 
>> functions that will need a large number of function clauses will make the 
>> issue worse and the size of the compiled elixir we distribute larger.
>>
>> I think it's better to have this functionality in a library until we can 
>> solve the memory issue and only have the bare necessities for unicode 
>> support in stdlib. If we later can move it into stdlib it would be good to 
>> have the API figured out and bugs fixed in another library that can iterate 
>> faster.
>>
>> On Tue, May 3, 2016 at 11:29 PM, eksperimental <[email protected]> 
>> wrote:
>>
>>> I'm not too sure if we should have all those many functions should be
>>> added. it could be too many of them, and not easy to extend..
>>> but how about an Unicode.info/1 function, that returns a tuple with
>>> information about that character. such as
>>> iex> Unicode.info("A")
>>> ...> {:alphanumeric, :uppercase, :ascii}
>>>
>>> It will be easy to improve as we find more information can be added,
>>> such as ISO types and other groups (Specially to encodings we are not
>>> familiar with)
>>>
>>> Additionally we could have check?/2 (or some better name probably!)
>>> iex> Unicode.check?("A", :uppercase)
>>> ...> true
>>> iex> Unicode.check?("A", :numeric)
>>> ...> false
>>>
>>>
>>> created, but On Tue, 3 May 2016 12:31:44 -0700 (PDT)
>>> [email protected] wrote:
>>>
>>> > I have seen multiple people (In the Elixir Slack group
>>> > <https://elixir-lang.slack.com/archives/general/p1462294660007855>,
>>> > on Reddit
>>> > <
>>> https://www.reddit.com/r/elixir/comments/4h4y4e/whats_missing_from_the_elixir_ecosystem/d2nvbwd
>>> >)
>>> > during the last couple of days requiring something that checks if a
>>> > (possibly long) string contains e.g. only alphanumeric characters.
>>> >
>>> > It is possible to do this using regular expressions right now:
>>> > ~r/[^[:alnum:]]/u
>>> >
>>> > but this is very slow.
>>> >
>>> > My proposal is to add the following boolean functions to the String
>>> > module:
>>> >
>>> >
>>> >    -  alphabetic?
>>> >    -  numeric?
>>> >    -  alphanumeric?
>>> >    -  whitespace?
>>> >    -  uppercase?
>>> >    -  lowercase?
>>> >    -  control_character?
>>> >
>>> >
>>> > Function heads for these functions can probably be best generated by
>>> > using compile-time macros similar to what other unicode-based
>>> > functions already use.
>>> >
>>>
>>> --
>>> You received this message because you are subscribed to the Google 
>>> Groups "elixir-lang-core" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elixir-lang-core/20160504042910.57fd86e0.eksperimental%40autistici.org
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> Eric Meadows-Jönsson
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/3f1114cd-3110-4056-8770-bc4690930b9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elixir-core:5665] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

Reply via email to