On Tue, Oct 21, 2014 at 11:38 AM, Jonathan S. Shapiro <[email protected]> wrote:
> On Wed, Oct 15, 2014 at 10:00 PM, Ben Kloosterman <[email protected]> > wrote: > >> 2. Does following set of rules for strings make sense? If no, why not? >> >> - Strings are normalized via NFC >> - String operations preserve NFC encoding >> >> BK> Not sure if you cant treat this as granted .. say web sites UTF8 and >> Xml messages do you really want to parse and rearrange all that data to >> ensure NFC compliance ? >> > > I claim that if you have to re-normalize it, then the incoming XML data > isn't text. > Or you are not sure eg a lib. > > That doesn't stop you from dealing with it as byte data via a byte vector, > and I can definitely see a case for implementing many string operations > over either byte vector (or a wrapper on byte vector). > > Here's the thing: Unicode makes quite a mess of things by permitting valid > text to be un-normalized in the first place. I'm not at all sure why they > did that; it seems to me that they could easily have put well-formedness > rules in place. Though perhaps once you had both NFC and NFD that ship had > sailed. > Agree this searching should not have this complication . > > What really has me in a twist here is that you want things like string > comparison and search to work sensibly without huge complications, and on > longer strings you would really rather not be forced to copy them in order > to normalize them for comparison and search. > > Has anybody looked into algorithms for search and compare that normalize > on the fly? If the penalty isn't too great, maybe that's the right way to > resolve this. > I assume this will not work behind the scenes with strings being immutable or be expensive creating other data types. I think its best to assume strings are compliant but not enforce it ( from the above i think we are on the same page - earlier the language was a bit strong) .. users should convert on import if they are not sure and the default encoders should do this. ie when importing a UTf8 web page or Xml you can directly just wrap the byte data to make a string if you are sure of your data and the user should convert but if you run an encoder you do normalize it which includes a UTf8 to Utf8NFCencoder. > > > > > >> In terms of performance / compatibility of old algorithms/ benchmarks ascii >> is still important. For this reason its important to know that the UTF8 >> is ascii. >> > > That's a very unfortunate thing to have as important, since it isn't > correct. Perhaps you mean ASCII is valid UTF-8? > Obviously ASCII is valid UTF8 and the reverse is not. To be clear you want to know when a UTF8 string when created is all ASCII chars or if you need to find this out . This allows much more efficient conversion to char array or encoding to other formats ( eg a windows syscall can run a noticeably more efficient algorithm if it knows its ASCII ) . This should be a optional constructor overload and set for constant data. An optimization to be sure but an important one. The big question is will strings or sub-strings be / support slices Regards, Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
