Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-14 Thread Armin Ronacher
Hi, On 02/05/2014 09:55, Malthe Borch wrote: It blows up – as expected, because ascii is a limited encoding. It's not just ascii. That said, blowing up at encoding time is terrible because you don't know where the error comes from. This is especially a huge problem on Python 3 right now

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-14 Thread Armin Ronacher
Hi, On 02/05/2014 00:03, John Downey wrote: I have actually always been a fan of how .NET did this. The System.String type is opinionated in how it is stored internally and does not allow anyone to change that (unlike Ruby). The conversion from String to byte[] is done using explicit conversion

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-14 Thread Matthieu Monrocq
On Wed, May 14, 2014 at 2:25 PM, Armin Ronacher armin.ronac...@active-4.com wrote: Hi, On 02/05/2014 00:03, John Downey wrote: I have actually always been a fan of how .NET did this. The System.String type is opinionated in how it is stored internally and does not allow anyone to change

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-02 Thread Ben Kloosterman
The encoding / glob. code in .NET works well , the strings use of code-points is poor choice and both C# and Java suffer heavily for it when doing IO. Ropes / chords/ chains etc belong at a higher level not the lowest level type. Ben On Fri, May 2, 2014 at 8:03 AM, John Downey

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-02 Thread Malthe Borch
On 2 May 2014 00:06, Tony Arcieri basc...@gmail.com wrote: This sounds like the exact same painful failure mode as Ruby (transcoding blowing up at completely unexpected times) with even more complexity, making it even harder to debug. Here is a concrete example of when this would blow up: 1.

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-02 Thread Ben Kloosterman
On 5/2/14, Malthe Borch mbo...@gmail.com wrote: On 2 May 2014 00:06, Tony Arcieri basc...@gmail.com wrote: This sounds like the exact same painful failure mode as Ruby (transcoding blowing up at completely unexpected times) with even more complexity, making it even harder to debug. Here is

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-02 Thread Mikhail Zabaluev
Hi, 2014-05-02 3:52 GMT+03:00 Nathan Myers n...@cantrip.org: There's a string type because it *enforces* the guarantee of containing valid UTF-8, meaning it can always be converted to code points. This also means all of the Unicode algorithms can assume that they're dealing with a valid

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Mikhail Zabaluev
Hi, 2014-05-01 16:53 GMT+03:00 Malthe Borch mbo...@gmail.com: In Rust, the built-in std::str type is a sequence of unicode codepoints encoded as a stream of UTF-8 bytes. Meanwhile, building on experience with Python 2 and 3, I think it's worth considering a more flexible design. A string

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Nathan Myers
It would be a mistake for a byte sequence container, stream, or string type to know anything about particular encodings. An encoding is an interpretation imposed on a byte sequence. Users of a sequence need to be able to choose what interpretation to apply without interference from some

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Malthe Borch
On Thursday, May 1, 2014, Nathan Myers n...@cantrip.org wrote: It would be a mistake for a byte sequence container, stream, or string type to know anything about particular encodings. An encoding is an interpretation imposed on a byte sequence. Users of a sequence need to be able to choose

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Tony Arcieri
On Thu, May 1, 2014 at 6:53 AM, Malthe Borch mbo...@gmail.com wrote: A string would be essentially a rope where each leaf specifies an encoding, e.g. UTF-8 or ISO8859-1 (ideally expressed as one or two bytes). That is, a string may be comprised of segments of different encodings. Oh god

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Malthe Borch
On 1 May 2014 21:03, Tony Arcieri basc...@gmail.com wrote: Oh god no! Please no. This is what Ruby does and it's a complete nightmare. This creates an entire new class of bug when operations are performed on strings with incompatible encodings. It's an entire class of bug that simply doesn't

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Malthe Borch
On 1 May 2014 18:54, Mikhail Zabaluev mikhail.zabal...@gmail.com wrote: I don't think that so much hidden complexity would be justified in the built-in string type. Encoded text is typically dealt with in protocol libraries or similar I/O barriers where it should be passed through a validating

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Tony Arcieri
On Thu, May 1, 2014 at 1:06 PM, Malthe Borch mbo...@gmail.com wrote: This is not the case in the proposed design. You're wrong. All string operations would behave exactly as if there was only a single encoding. The only requirement is that the strings are properly declared with an

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Daniel Micay
On 01/05/14 09:53 AM, Malthe Borch wrote: In Rust, the built-in std::str type is a sequence of unicode codepoints encoded as a stream of UTF-8 bytes. Meanwhile, building on experience with Python 2 and 3, I think it's worth considering a more flexible design. A string would be essentially

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Steve Klabnik
Yes, this is what Ruby does, and yes, it causes a lot of tears. It's one of the biggest things that made the 1.8 - 1.9 transition difficult. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Malthe Borch
On 1 May 2014 22:42, Tony Arcieri basc...@gmail.com wrote: No, when you combine strings with different encodings, you need to transcode one of the strings. When this happens, the transcoding process may encounter some characters which are valid in one encoding, but not another, in which case

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Patrick Walton
On 5/1/14 6:53 AM, Malthe Borch wrote: In Rust, the built-in std::str type is a sequence of unicode codepoints encoded as a stream of UTF-8 bytes. Meanwhile, building on experience with Python 2 and 3, I think it's worth considering a more flexible design. A string would be essentially a rope

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread John Downey
I have actually always been a fan of how .NET did this. The System.String type is opinionated in how it is stored internally and does not allow anyone to change that (unlike Ruby). The conversion from String to byte[] is done using explicit conversion methods like: -

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Tony Arcieri
On Thu, May 1, 2014 at 2:45 PM, Malthe Borch mbo...@gmail.com wrote: The transcoding needs to happen only at the time when you flatten the rope into a single encoding. And yes, it may then fail if you attempt to encode into a non-unicode encoding. This sounds like the exact same painful

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Thad Guidry
Agreed with Patrick. This proposal should not be in std::str ... it can live somewhere else...but not there. -- -Thad +ThadGuidry https://www.google.com/+ThadGuidry Thad on LinkedIn http://www.linkedin.com/in/thadguidry/ On Thu, May 1, 2014 at 4:52 PM, Patrick Walton pcwal...@mozilla.com

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Nathan Myers
On 05/01/2014 02:52 PM, Patrick Walton wrote: On 5/1/14 6:53 AM, Malthe Borch wrote: In Rust, the built-in std::str type is a sequence of unicode codepoints encoded as a stream of UTF-8 bytes. ... A string would be essentially a rope where each leaf specifies an encoding, e.g. UTF-8 or

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Daniel Micay
On 01/05/14 07:49 PM, Nathan Myers wrote: On 05/01/2014 02:52 PM, Patrick Walton wrote: On 5/1/14 6:53 AM, Malthe Borch wrote: In Rust, the built-in std::str type is a sequence of unicode codepoints encoded as a stream of UTF-8 bytes. ... A string would be essentially a rope where each leaf

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Tony Arcieri
On Thu, May 1, 2014 at 4:49 PM, Nathan Myers n...@cantrip.org wrote: The history of programming languages is littered with mistakes around string types. There's no reason why Rust must repeat them all. FWIW, I've worked in systems that work the way you describe, and I disagree and think

Re: [rust-dev] UTF-8 strings versus encoded ropes

2014-05-01 Thread Nathan Myers
On 05/01/2014 04:57 PM, Daniel Micay wrote: On 01/05/14 07:49 PM, Nathan Myers wrote: In defining a library string we always grapple over how it should differ from a raw (variable or fixed) array of bytes. Ease of appending and of assigning into substrings always comes up. In the old days,