Re: [bitc-dev] BitC Strings and Unicode

Ben Kloosterman Mon, 20 Oct 2014 19:13:45 -0700

On Tue, Oct 21, 2014 at 11:38 AM, Jonathan S. Shapiro <[email protected]>
wrote:

> On Wed, Oct 15, 2014 at 10:00 PM, Ben Kloosterman <[email protected]>
> wrote:
>
>> 2. Does following set of rules for strings make sense? If no, why not?
>>
>>    - Strings are normalized via NFC
>>    - String operations preserve NFC encoding
>>
>> BK> Not sure if you cant treat this as granted .. say web sites UTF8 and
>> Xml messages do you really want to parse and rearrange all that data to
>> ensure NFC compliance ?
>>
>
> I claim that if you have to re-normalize it, then the incoming XML data
> isn't text.
>

Or you are not sure eg a lib.

>
> That doesn't stop you from dealing with it as byte data via a byte vector,
> and I can definitely see a case for implementing many string operations
> over either byte vector (or a wrapper on byte vector).
>
> Here's the thing: Unicode makes quite a mess of things by permitting valid
> text to be un-normalized in the first place. I'm not at all sure why they
> did that; it seems to me that they could easily have put well-formedness
> rules in place. Though perhaps once you had both NFC and NFD that ship had
> sailed.
>

Agree this searching should not have this complication .

>
> What really has me in a twist here is that you want things like string
> comparison and search to work sensibly without huge complications, and on
> longer strings you would really rather not be forced to copy them in order
> to normalize them for comparison and search.
>
> Has anybody looked into algorithms for search and compare that normalize
> on the fly? If the penalty isn't too great, maybe that's the right way to
> resolve this.
>

I assume this will not work behind the scenes with strings being immutable
or be expensive creating other data types. I think its best to assume
strings are compliant but not enforce it ( from the above i think we are on
the same page - earlier the language was a bit strong)  .. users should
convert on import if they are not sure and the default encoders should do
this.
ie when importing a UTf8 web page or Xml you can directly just wrap the
byte data to make a string if you are sure of your data  and the user
should convert but if you run an encoder you do normalize it which includes
a UTf8 to Utf8NFCencoder.

>
>
>
>
>
>> In terms of performance / compatibility of old algorithms/ benchmarks   ascii
>> is still important. For this reason its important to know  that the UTF8
>> is ascii.
>>
>
> That's a very unfortunate thing to have as important, since it isn't
> correct. Perhaps you mean ASCII is valid UTF-8?
>

Obviously ASCII is  valid UTF8 and the reverse is not. To be clear you
 want to know when a UTF8 string when created is all ASCII chars or if you
need to find this out . This allows much more efficient conversion to char
array or encoding to other formats  ( eg a windows syscall can run a
noticeably more efficient algorithm if it knows its ASCII ) . This should
be a optional constructor overload  and set for constant data. An
optimization to be sure but an important one.

The big question is will strings or sub-strings be / support slices

Regards,

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] BitC Strings and Unicode

Reply via email to