Hi Cameron,

If the goal is to include this handling for UTF-16 and UTF-32, I suggest
proposing this to Erlang/OTP as new functions in the "unicode" module.
Otherwise, Elixir only has facilities to deal with UTF-8. You could propose
such a feature in their issues tracker.

Also note that "rolling your own" or "depending on packages" is usually not
enough reasons for adding features to Elixir. Otherwise, one could easily
argue Decimal and Jason would be more important additions to the language.
:) We do describe which features we would consider part of the language
here: https://elixir-lang.org/development.html

Other than that, awesome job on the library and benchmarks. :)

On Sat, Oct 7, 2023 at 1:03 AM Kip <kipco...@gmail.com> wrote:

> Your implementation is definitely fast and memory efficient so I retract
> my implementation comments. Now that I've run the benchmarking script and
> tested out a few different approaches leveraging the std lib I understand
> better why you've taken the approach you have. Nice work.
>
> On Saturday, October 7, 2023 at 9:26:37 AM UTC+11 Kip wrote:
>
>> Cameron, I think this is a useful proposal.  Elixir has means to check
>> validity (String.valid?/1) and a mechanism to split valid and invalid code
>> points (String.chunk/2 with the :valid trait). But there isn't, to my
>> knowledge, a means to coerce validity.  A couple of thoughts:
>>
>> 1. Since Elixir strings are, by definition, UTF8, I don't know that
>> special handling of UTF16 and UTF32 code points makes much sense - although
>> I accept this may be more Unicode compliant.
>> 2. What would the function be called? Since we have String.valid?/1 maybe
>> String.validate/2 with an option `replace_invalid: utf8_string`. The
>> default `:replace_invalid` could be U+FFFD or it could be `nil`.   If
>> the default is `nil` then there could also be a `String.validate!/2` that
>> raises if there is no `:replace_invalid` option.
>> 3. I think the implementation could leverage the code of `String.chunk/2`
>> which uses `String.next_codepoint/1`. That would simplify implementation
>> and be more consistent in code style.
>>
>> On Friday, October 6, 2023 at 12:24:28 PM UTC+11 cameron...@gmail.com
>> wrote:
>>
>>> As far as I can tell, neither Elixir nor Erlang have a built in function
>>> for replacing invalid sequences in Unicode. There's a suggested method on
>>> this page
>>> <https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153>
>>> of the Unicode standard for handling this. Several other languages (Go
>>> <https://pkg.go.dev/bytes#ToValidUTF8>, Python
>>> <https://docs.python.org/3/library/stdtypes.html#bytes.decode>, C#
>>> <https://github.com/dotnet/docs/issues/13547>, etc) now follow this
>>> spec.
>>>
>>> Invalid Unicode's encountered frequently enough that I think it's worth
>>> incorporating a solution into Elixir itself.
>>>
>>> Present alternatives to handling invalid unicode (and json by extension
>>> <https://github.com/michalmuskala/jason/issues/174>) are:
>>>
>>>    - Crashing (not ideal in many cases)
>>>    - Roll your own (lot of overhead for accidental complexity)
>>>    - Depend on a package (+1 package towards dependency hell)
>>>
>>> This is my college try
>>> <https://github.com/Moosieus/UniRecover/tree/main>, but I'm certain
>>> there's a performant and far cleaner solution to be had in pure Elixir. If
>>> not, perhaps this is a request for OTP.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "elixir-lang-core" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elixir-lang-core+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com
> <https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elixir-lang-core+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4L%2B4YJ18TJTez_dAh88-jvbTQp9H4ieFYLgWwV0k14DAg%40mail.gmail.com.

Reply via email to