#21932: Stop relying on the platform's default charset -------------------------------------+-------------------------- Reporter: karsten | Owner: metrics-team Type: defect | Status: new Priority: Medium | Milestone: Component: Metrics/metrics-lib | Version: Severity: Normal | Keywords: Actual Points: | Parent ID: Points: | Reviewer: Sponsor: | -------------------------------------+-------------------------- While looking into the encoding issue of different Onionoo instances producing different contact string encodings (#15813), I tracked down this issue to metrics-lib's `ServerDescriptorImpl.java` class and its usage of `new String(byte[])`.
The issue is that the constructor above uses "the platform's default charset". Turns out that the main Onionoo instance uses `US-ASCII` as default charset (`Charset.defaultCharset()`) and the mirror uses `UTF-8`. (Interestingly, the mirror only uses `UTF-8` for commands executed by cron and also uses `US-ASCII` for commands directly executed by my user, so the default would change depending on whether Onionoo's updater was started automatically after a reboot or started manually by the user; which made debugging just a bit more challenging!) Long story short, we should not rely on the platform's default charset when converting bytes to strings or vice versa, but we should explicitly specify the charset we want! We just need to pick one. Somewhat related I ran an analysis of character encodings in relay server descriptors two weeks ago. Here's what I found: {{{ $ wget https://collector.torproject.org/archive/relay-descriptors/server- descriptors/server-descriptors-2017-02.tar.xz $ tar xf server-descriptors-2017-02.tar.xz $ find server-descriptors-2017-02 -type f -exec file --mime {} \; > mimes $ cut -d" " -f3 mimes | sort | uniq -c 68 charset=iso-8859-1 466900 charset=us-ascii 1145 charset=utf-8 }}} I'd say let's just pretend that server descriptors are UTF-8 encoded. In this case, the following patch will resolve the issue for server descriptors: {{{ diff --git a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java index 309cad4..2381378 100644 --- a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java +++ b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java @@ -8,6 +8,7 @@ import org.torproject.descriptor.DescriptorParseException; import org.torproject.descriptor.ServerDescriptor; import java.io.UnsupportedEncodingException; +import java.nio.charset.StandardCharsets; import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; import java.util.ArrayList; @@ -56,8 +57,8 @@ public abstract class ServerDescriptorImpl extends DescriptorImpl } private void parseDescriptorBytes() throws DescriptorParseException { - Scanner scanner = new Scanner(new String(this.rawDescriptorBytes)) - .useDelimiter("\n"); + Scanner scanner = new Scanner(new String(this.rawDescriptorBytes, + StandardCharsets.UTF_8)).useDelimiter("\n"); String nextCrypto = ""; List<String> cryptoLines = null; while (scanner.hasNext()) { }}} If this sounds like a reasonable plan, we should look into other places in the code where we use methods relying on the platform's default charset and explicitly specify a charset there, too. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/21932> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online _______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs