Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
Reminder to please vote for a time for this if you'd still like to attend! isis agora lovecruft transcribed 2.2K bytes: > Hello, > > Let's schedule a proposal discussion for prop#285 "Directory documents > should be standardized as UTF-8" [0] sometime between 12 - 13 Feb. If > you're CCed, it's because you put your name down on the pad as being > interested in this discussion. If anyone has requests or concerns, or if I > forgot to take your timezone into account, please let me know. > > https://doodle.com/poll/cnc6scybbfpky5f8 > > [0]: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt Best regards, -- ♥Ⓐ isis agora lovecruft _ OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35 Current Keys: https://fyb.patternsinthevoid.net/isis.txt signature.asc Description: Digital signature ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
This one is in #tor-meeting, next Monday, 12 February from 21:00-22:00 UTC. In local times: * Monday, 12 February 13:00-14:00 PST * Monday, 12 February 16:00-17:00 EST * Monday, 12 February 22:00-23:00 CET * Tuesday, 13 February 08:00-09:00 AEST isis agora lovecruft transcribed 2.3K bytes: > Reminder to please vote for a time for this if you'd still like to attend! > > isis agora lovecruft transcribed 2.2K bytes: > > Hello, > > > > Let's schedule a proposal discussion for prop#285 "Directory documents > > should be standardized as UTF-8" [0] sometime between 12 - 13 Feb. If > > you're CCed, it's because you put your name down on the pad as being > > interested in this discussion. If anyone has requests or concerns, or if I > > forgot to take your timezone into account, please let me know. > > > > https://doodle.com/poll/cnc6scybbfpky5f8 > > > > [0]: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt -- ♥Ⓐ isis agora lovecruft _ OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35 Current Keys: https://fyb.patternsinthevoid.net/isis.txt signature.asc Description: Digital signature ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
Hi! The notes from this meeting are online. [0] Thanks to everyone who attended! Extra thanks to teor for conducting the meeting since I was stupidly 8 minutes late due to impatiently watching a kettle boil after eating very spicy cioppino and then *extremely* needing a glass of iced tea immediately. We found some issues w.r.t. the specifics of the proposal, but overall we've agreed that it should be accepted in (roughly, after some minor revision) in its current state. As such, it is looking for someone interested in implementing it! (THIS COULD BE YOU) A couple outcomes of this: 1. What passes for "canonicalised" "utf-8" in C will be different to what passes for "canonicalised" "utf-8" in Rust. In C, the following will not be allowed (whereas they are allowed in Rust): - NUL (0x00) - Byte Order Mark (0xFEFF) 2. Directory document keywords MUST be printable ASCII. 3. This change may break some descriptor/consensus/document parsers. If you are the maintainer of a parser, you may want to start thinking about this now. [0]: http://meetbot.debian.net/tor-meeting/2018/tor-meeting.2018-02-12-21.04.html Best regards, -- ♥Ⓐ isis agora lovecruft _ OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35 Current Keys: https://fyb.patternsinthevoid.net/isis.txt signature.asc Description: Digital signature ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
> On 13 Feb 2018, at 10:55, isis agora lovecruft wrote: > > A couple outcomes of this: > > 1. What passes for "canonicalised" "utf-8" in C will be different to >what passes for "canonicalised" "utf-8" in Rust. In C, the >following will not be allowed (whereas they are allowed in Rust): >- NUL (0x00) >- Byte Order Mark (0xFEFF) I want to clarify this point: The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the bytes 0xEF 0xBB 0xBF. Tor's C and Rust implementations of UTF-8 must be identical. When we write the C implementation, we must reject NUL for compatibility with C string functions. When we write the Rust implementation, we must reject NUL for compatibility with the C implementation. (Rust already implements UTF-8 strings that accept NUL, so this will require custom code). When we write the C and Rust implementations, we must reject BOM because it's unnecessary. Rejecting BOM is recommended by the relevant standard. (Rust already implements UTF-8 strings that accept BOM, so this will require custom code). T ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
Hi, On 12/02/18 23:55, isis agora lovecruft wrote: > 1. What passes for "canonicalised" "utf-8" in C will be different to > what passes for "canonicalised" "utf-8" in Rust. In C, the > following will not be allowed (whereas they are allowed in Rust): > - NUL (0x00) > - Byte Order Mark (0xFEFF) Much of the metrics software is written in Java. Java strings allow for NUL to appear, but assume that there is no BOM. If a BOM appears, then this would be interpreted as data and, I assume, parsing would probably fail. Should the whole document be rejected if it contains a NUL or BOM, or should these values be stripped and then carry on parsing as if it never happened? > 2. Directory document keywords MUST be printable ASCII. This can be validated. Should a single document keyword containing printable non-ASCII be enough to reject the document, or should a parser try to recover? I'd really like to see a section in the proposal about how parsers should react when they find something unexpected, otherwise all the parsers may end up doing different things. > 3. This change may break some descriptor/consensus/document parsers. > If you are the maintainer of a parser, you may want to start > thinking about this now. For the metrics tools there are some guidelines on this we can follow: https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other language would be Python (for stem), but Python developers have probably got a good understanding of unicode/str/bytes by now. (In Python 3: when using UTF-8, BOM will not be stripped and will be interpreted as data, and you can have a NUL in a str). Thanks, Iain. signature.asc Description: OpenPGP digital signature ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
> On 13 Feb 2018, at 21:55, Iain Learmonth wrote: > > Hi, > >> On 12/02/18 23:55, isis agora lovecruft wrote: >> 1. What passes for "canonicalised" "utf-8" in C will be different to >>what passes for "canonicalised" "utf-8" in Rust. In C, the >>following will not be allowed (whereas they are allowed in Rust): >>- NUL (0x00) >>- Byte Order Mark (0xFEFF) > > Much of the metrics software is written in Java. Java strings allow for > NUL to appear, but assume that there is no BOM. If a BOM appears, then > this would be interpreted as data and, I assume, parsing would probably > fail. Should the whole document be rejected if it contains a NUL or BOM, > or should these values be stripped and then carry on parsing as if it > never happened? Directory authorities and bridge clients already reject descriptors that contain NUL. (This is an artefact of the C implementation: the descriptor is seen as truncated, so it won't parse.) We should specify rejection for BOM as well. >> 2. Directory document keywords MUST be printable ASCII. > > This can be validated. Should a single document keyword containing > printable non-ASCII be enough to reject the document, or should a parser > try to recover? If parsers want to be consistent with the Tor implementation, they should reject. > I'd really like to see a section in the proposal about how parsers > should react when they find something unexpected, otherwise all the > parsers may end up doing different things. +1 >> 3. This change may break some descriptor/consensus/document parsers. >>If you are the maintainer of a parser, you may want to start >>thinking about this now. > > For the metrics tools there are some guidelines on this we can follow: > https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other > language would be Python (for stem), but Python developers have probably > got a good understanding of unicode/str/bytes by now. (In Python 3: when > using UTF-8, BOM will not be stripped and will be interpreted as data, > and you can have a NUL in a str). Python for txtorcon Rust for Tor's experimental protover implementation And perhaps others: https://stem.torproject.org/faq.html#are-there-any-other-controller-libraries https://trac.torproject.org/projects/tor/wiki/doc/ListOfTorImplementations T___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
> For the metrics tools there are some guidelines on this we can follow: > https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other > language would be Python (for stem), but Python developers have probably > got a good understanding of unicode/str/bytes by now. (In Python 3: when > using UTF-8, BOM will not be stripped and will be interpreted as data, > and you can have a NUL in a str). Hi Iain. Actually, for Stem I'm really looking forward to this too. Stem has special handling for the contact and platform fields (iirc the only spot non-ascii content can presently appear). Stem's parsers and API will be simplified once everything is uniformly utf-8. :P Possibly a stupid question but any reason not to require the whole descriptor document to be printable characters? ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
> On 14 Feb 2018, at 11:03, Damian Johnson wrote: > >> For the metrics tools there are some guidelines on this we can follow: >> https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other >> language would be Python (for stem), but Python developers have probably >> got a good understanding of unicode/str/bytes by now. (In Python 3: when >> using UTF-8, BOM will not be stripped and will be interpreted as data, >> and you can have a NUL in a str). > > Hi Iain. Actually, for Stem I'm really looking forward to this too. > Stem has special handling for the contact and platform fields (iirc > the only spot non-ascii content can presently appear). Stem's parsers > and API will be simplified once everything is uniformly utf-8. :P > > Possibly a stupid question but any reason not to require the whole > descriptor document to be printable characters? Requiring printable ASCII throughout the document means that people can't spell their names and email addresses correctly in contact lines. Requiring printable unicode introduces a dependency on a particular unicode version, because we don't know if unallocated blocks will be printable or not. I think we could make platform lines printable ASCII without losing much. Unless there are platforms that have non-ASCII names? T -- Tim Wilson-Brown (teor) teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n signature.asc Description: Message signed with OpenPGP ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8" (was: Nominate/vote for future proposal discussion meetings!)
Hello, Let's schedule a proposal discussion for prop#285 "Directory documents should be standardized as UTF-8" [0] sometime between 12 - 13 Feb. If you're CCed, it's because you put your name down on the pad as being interested in this discussion. If anyone has requests or concerns, or if I forgot to take your timezone into account, please let me know. https://doodle.com/poll/cnc6scybbfpky5f8 [0]: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt Best regards, -- ♥Ⓐ isis agora lovecruft _ OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35 Current Keys: https://fyb.patternsinthevoid.net/isis.txt signature.asc Description: Digital signature ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev