Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
On Dec 29, 2014, at 5:36 PM, Mike Cardwell wrote: So the system I've settled with is storing both the originally supplied representation, *and* the lower cased punycode encoded version in a separate column for indexing/search. This seems really hackish to me though. I actually do the same exact thing and don't think it's hackish. I actually really like being able to see the punycode next to the human representation on simple db pulls. It's barely more disk space and really useful. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
I'd like to store hostnames in a postgres database and I want to fully support IDNs (Internationalised Domain Names) I want to be able to recover the original representation of the hostname, so I can't just encode it with punycode and then store the ascii result. For example, these two are the same hostnames thanks to unicode case folding [1]: tesst.ëxämplé.com teßt.ëxämplé.com They both encode in punycode to the same thing: xn--tesst.xmpl.com-cib7f2a Don't believe me, then try visiting any domain with two s's in, whilst replacing the s's with ß's. E.g: ericßon.com nißan.com americanexpreß.com So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no idea which of those two hostnames was the original representation. The trouble is, if I store the unicode representation of a hostname instead, then when I run queries with conditions like: WHERE hostname='nißan.com' that wont pull out rows where hostname='nissan.com'. Also, if I create a unique index on the hostname field, or even on lower(hostname), that wont stop the same hostname being inserted more than once, with slightly different representations. So the system I've settled with is storing both the originally supplied representation, *and* the lower cased punycode encoded version in a separate column for indexing/search. This seems really hackish to me though. It seems to me that Postgres would benefit from a native hostname type and/or a pair of punycode encode/decode functions. And perhaps even a simple unicode case folding function. With the end result that these return TRUE: unicode_case_fold('ß') = 'ss' 'xn--tesst.xmpl.com-cib7f2a' = punycode_encode('teßt.ëxämplé.com') punycode_decode('xn--tesst.xmpl.com-cib7f2a') = 'tesst.ëxämplé.com' A native type would also be able to apply suitable constraints, e.g a maximum length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens on a label, etc. I initially created a pair of functions for using punycode using plperl and the IDNA::Punycode Perl module, but the performance wasn't good enough, so I settled with duplicating the data to sacrifice disk space and complexity, for speed. I'm new to Postgres, and to this list, so if there is a better way for me to submit this suggestion or solve my problem, please point me in the right direction. [1] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt Regards, -- Mike Cardwell https://grepular.com https://emailprivacytester.com OpenPGP Key35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F XMPP OTR Key 8924 B06A 7917 AAF3 DBB1 BF1B 295C 3C78 3EF1 46B4 signature.asc Description: Digital signature
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
On 12/29/2014 4:36 PM, Mike Cardwell wrote: I'd like to store hostnames in a postgres database and I want to fully support IDNs (Internationalised Domain Names) I want to be able to recover the original representation of the hostname, so I can't just encode it with punycode and then store the ascii result. For example, these two are the same hostnames thanks to unicode case folding [1]: tesst.ëxämplé.com teßt.ëxämplé.com They both encode in punycode to the same thing: xn--tesst.xmpl.com-cib7f2a Don't believe me, then try visiting any domain with two s's in, whilst replacing the s's with ß's. E.g: ericßon.com nißan.com americanexpreß.com So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no idea which of those two hostnames was the original representation. The trouble is, if I store the unicode representation of a hostname instead, then when I run queries with conditions like: WHERE hostname='nißan.com' _IF_ Postgres had a punycode function, then you could use: WHERE punycode(hostname) = punycode('nißan.com') -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
Andy Colson wrote On 12/29/2014 4:36 PM, Mike Cardwell wrote: I'd like to store hostnames in a postgres database and I want to fully support IDNs (Internationalised Domain Names) I want to be able to recover the original representation of the hostname, so I can't just encode it with punycode and then store the ascii result. For example, these two are the same hostnames thanks to unicode case folding [1]: tesst.ëxämplé.com teßt.ëxämplé.com They both encode in punycode to the same thing: xn--tesst.xmpl.com-cib7f2a Don't believe me, then try visiting any domain with two s's in, whilst replacing the s's with ß's. E.g: ericßon.com nißan.com americanexpreß.com So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no idea which of those two hostnames was the original representation. The trouble is, if I store the unicode representation of a hostname instead, then when I run queries with conditions like: WHERE hostname='nißan.com' _IF_ Postgres had a punycode function, then you could use: WHERE punycode(hostname) = punycode('nißan.com') If the OP wraps what he is doing up into a function that is what you end up getting: a memoized punycode function. http://en.wikipedia.org/wiki/Memoization It has to be defined as volatile but basically write the function to check for the provided input on the indexed table and if it doesn't exist the function will calculate the punycode value and store it onto the table before returning the punycode value to the caller. The question then becomes how to perform the actual punycode conversion. The API within PostgreSQL becomes that single function and you store only the original hostname in the user-data area. Whenever you need to access the punycode version you call the function. The only downside is that because the function is volatile you cannot create a functional index using it. It may be worth writing both an immutable conversion function and a memoization wrapper so you can index using the former while using the memoized version normally. I'm getting a little outside my experience level here as it seems there should be a better way to define a volatile function that is deterministic so that you get the best of both worlds. Just because a function modifies the database doesn't mean that its output value is not solely a function of its inputs - and that all side effects are of a logging or caching nature. David J. -- View this message in context: http://postgresql.nabble.com/Hostnames-IDNs-Punycode-and-Unicode-Case-Folding-tp5832350p5832360.html Sent from the PostgreSQL - general mailing list archive at Nabble.com. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
* on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote: WHERE hostname='nißan.com' _IF_ Postgres had a punycode function, then you could use: WHERE punycode(hostname) = punycode('nißan.com') If the OP wraps what he is doing up into a function that is what you end up getting: a memoized punycode function. http://en.wikipedia.org/wiki/Memoization It has to be defined as volatile but basically write the function to check for the provided input on the indexed table and if it doesn't exist the function will calculate the punycode value and store it onto the table before returning the punycode value to the caller. I'm not sure all that is necessary. It could be quite a simple function, like the lower() function. So what I would do is this: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); That would prevent adding more than one representation for the same hostname to the column. And I if I wanted to do a fast, indexed search where I could supply any representation of the hostname as input, I would just do: WHERE lower(punycode_encode(hostname)) = lower(punycode_encode('any-representation')) There doesn't need to be any extra table storage for the punycode encoded version. -- Mike Cardwell https://grepular.com https://emailprivacytester.com OpenPGP Key35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F XMPP OTR Key 8924 B06A 7917 AAF3 DBB1 BF1B 295C 3C78 3EF1 46B4 signature.asc Description: Digital signature
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
On Mon, Dec 29, 2014 at 11:50:54PM +, Mike Cardwell wrote: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); This wouldn't work to get the original back if oyu have any IDNA2003 data, because puncode-encoding the UTF-8 under IDNA2003 and then punycode-decoding it doesn't always result in the same label. See my other message. Did I mention that IDNA is a mess? A -- Andrew Sullivan a...@crankycanuck.ca -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
On Mon, Dec 29, 2014 at 4:51 PM, Mike Cardwell [via PostgreSQL] ml-node+s1045698n583236...@n5.nabble.com wrote: * on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote: WHERE hostname='nißan.com http://nissan.com' _IF_ Postgres had a punycode function, then you could use: WHERE punycode(hostname) = punycode('nißan.com http://nissan.com') If the OP wraps what he is doing up into a function that is what you end up getting: a memoized punycode function. http://en.wikipedia.org/wiki/Memoization It has to be defined as volatile but basically write the function to check for the provided input on the indexed table and if it doesn't exist the function will calculate the punycode value and store it onto the table before returning the punycode value to the caller. I'm not sure all that is necessary. It could be quite a simple function, like the lower() function. So what I would do is this: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); That would prevent adding more than one representation for the same hostname to the column. Except two different hostname can resolve to the same punycode_encode(hostname) value so the unique index won't work. It was also mentioned that using the Perl encoding function was non-performant; which is why caching the data into a memoization table has value. WHERE lower(punycode_encode(hostname)) = lower(punycode_encode('any-representation')) I'm not for knowing the rules of punycode but I'm not seeing what value lower() provides here... There doesn't need to be any extra table storage for the punycode encoded version. David J. -- View this message in context: http://postgresql.nabble.com/Hostnames-IDNs-Punycode-and-Unicode-Case-Folding-tp5832350p5832368.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
David G Johnston david.g.johns...@gmail.com writes: It was also mentioned that using the Perl encoding function was non-performant; which is why caching the data into a memoization table has value. I find it hard to believe that the standards folk would have chosen a hostname encoding method that was so inefficient that memoization using a database table would outperform just recomputing the encoding. What I suspect the above really means is man, this Perl code sucks. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
* on the Mon, Dec 29, 2014 at 07:00:05PM -0500, Andrew Sullivan wrote: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); This wouldn't work to get the original back if oyu have any IDNA200 data, because puncode-encoding the UTF-8 under IDNA2003 and the punycode-decoding it doesn't always result in the same label. See my other message. The original is the thing that is stored in the database. I wouldn't need to do any conversion to get the original back. In my example I am storing the original and creating an index on the punycode version. This is exactly the same method that we commonly use for performing case insensitive text searches using lower() indexes. -- Mike Cardwell https://grepular.com https://emailprivacytester.com OpenPGP Key35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F XMPP OTR Key 8924 B06A 7917 AAF3 DBB1 BF1B 295C 3C78 3EF1 46B4 signature.asc Description: Digital signature
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
[resending, because somehow this got routed through my work address the first time.] Hi, I didn't have time to write a short note, so I wrote a long one instead. Sorry. On Mon, Dec 29, 2014 at 10:36:42PM +, Mike Cardwell wrote: can't just encode it with punycode and then store the ascii result. For example, these two are the same hostnames thanks to unicode case folding [1]: tesst.ëxämplé.com teßt.ëxämplé.com Well, in IDNA2003 they're the same. In IDNA2008 (RFC 5890 and suite), they're not the same. In UTS46, they're kind of the same, because pre-lookup processing maps one of them to the other (it depends which mode you're in which way the mapping goes, which is just fantastic because you can't tell at the server which mode the client is in. IDNA is an unholy mess); but the lookup is still done using the IDNA2008 rules, approximately. They both encode in punycode to the same thing: xn--tesst.xmpl.com-cib7f2a Under no circumstances should they encode to that. IDNA, either 2003 or 2008, is label by label, not for entire domain names. (The labels are the things between the dots in a presentation-format FQDN.) So, under IDNA2003, you should get tesst.xn--xmpl-loa7ai.com for both of them. Under IDNA2008, you should get that also for the first of them, and xn--tet-6ka.xn--xmpl-loa7ai.com for the second. Most clients, however, are probably going to want to do something UTS46ish, but there's no way to guarantee that the IDNA2003 and IDNA2008 labels go together (this is called variants, and you do not want to know how much horror that has caused). Worse, of course, you also don't know that what you have isn't just raw UTF-8 in the label, but let's boil one ocean at a time. Don't believe me, then try visiting any domain with two s's in, whilst replacing the s's with ß's. E.g: ericßon.com nißan.com americanexpreß.com This depends entirely on which version of IDNA you're using. Many browsers right now officially do IDNA2003. Unfortunately, they don't _actually_ do that either because IDNA2003 is nominally nailed to Unicode version 3.2, and there's approximately zero chance that a running computer on the Internet these days is using such an old Unicode version. IDNA2008, by the way, wasn't something we did (I am one of the people you can blame for this) for fun. The very problem you're noting in IDNA2003 is one of the things we were trying to fix. Under IDNA2003, the Unicode-Punycode-Unicode round trip could lose data. Under IDNA2008, this is fixed: every A-label (the xn--Punycodehere version) corresponds to exactly one U-label (the Unicode representation) and conversely. (It follows from this that lots of Unicode strings aren't U-labels, because there are a lot of rules about what can be a U-label. For instance, capital letters aren't allowed, because they're not stable under caseFold. This is all in the IDNA2008 RFCs, but they're not an easy read. We tried.) So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no idea which of those two hostnames was the original representation. None. What you need to do is split the name on label boundaries (which is hard, because believe it or not . is a valid character in the DNS, but splitting on the . character is probably as good as you can do here. But look for escaped ones). Then you can check for validity under IDNA2008 and IDNA2003, and then you can run it through Punycode. ICANN, for whatever it's worth, is using IDNA2008 rules for top-level domains and as part of its IDNA guidelines, so over time the actually _registered_ names are going to be either IDNA2008 or, at worst, UTS46. The trouble is, if I store the unicode representation of a hostname instead, then when I run queries with conditions like: WHERE hostname='nißan.com' that wont pull out rows where hostname='nissan.com'. Right. If I were doing this, I think I'd probably create two functions, one to do IDNA2003 and one to do IDNA2008. Then I'd put a functional index on it for both cases. Eventually, you'll be able to drop the IDNA2003 lookup because everything will conform to IDNA2003. (Note that the WHATWG, which the W3C is going to listen to but which is not a W3C WG, appears to be trying to undo that; but IDNA2003 is irredeemably broken. So there is a mess here brewing.) So the system I've settled with is storing both the originally supplied representation, *and* the lower cased punycode encoded version in a separate column for indexing/search. This seems really hackish to me though. Well, see above. The other way I'd do it is to store _both_ the IDNA2003 punycode and also the IDNA2008 A-label. The reason I hate this is because the lookup is insanely complicated. It seems to me that Postgres would benefit from a native hostname type and/or a pair of punycode encode/decode functions. A pair won't work. And perhaps even a simple unicode case folding function. Unicode case folding is _way_ more
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
On Tue, Dec 30, 2014 at 12:18:58AM +, Mike Cardwell wrote: This is exactly the same method that we commonly use for performing case insensitive text searches using lower() indexes. Hmm. How did you get the original, then? If you have the original Unicode version, why don't you switch to IDNA2008 publication rules, which are way more reliable? In that case, you do have a 1:1 lookup and you shouldn't have a problem. If you need variants, then you have a different problem, but that actually can be specified for the much narrower range of UTF-8 permissible under IDNA2008. A -- Andrew Sullivan a...@crankycanuck.ca -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
* on the Mon, Dec 29, 2014 at 05:01:44PM -0700, David G Johnston wrote: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); That would prevent adding more than one representation for the same hostname to the column. Except two different hostname can resolve to the same punycode_encode(hostname) value That's exactly what I'm taking advantage of... so the unique index won't work. I think you misunderstand what I'm suggesing. I'm suggesting storing the unicode version in the database, and then creating a unique index on the result of the punycode function to prevent more than one representation of the same hostname being inserted. It was also mentioned that using the Perl encoding function was non-performant; which is why caching the data into a memoization table has value. It was non-performant because it involved unnecessarily starting a Perl interpreter. Not because the algorithm its self is slow. WHERE lower(punycode_encode(hostname)) = lower(punycode_encode('any-representation')) I'm not for knowing the rules of punycode but I'm not seeing what value lower() provides here... Case insensitive matching. So that EXAMPLE.COM = example.com -- Mike Cardwell https://grepular.com https://emailprivacytester.com OpenPGP Key35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F XMPP OTR Key 8924 B06A 7917 AAF3 DBB1 BF1B 295C 3C78 3EF1 46B4 signature.asc Description: Digital signature
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
* on the Mon, Dec 29, 2014 at 07:22:21PM -0500, Andrew Sullivan wrote: can't just encode it with punycode and then store the ascii result. For example, these two are the same hostnames thanks to unicode case folding [1]: tesst.ëxämplé.com teßt.ëxämplé.com Well, in IDNA2003 they're the same. In IDNA2008 (RFC 5890 and suite), they're not the same. In UTS46, they're kind of the same, because pre-lookup processing maps one of them to the other (it depends which mode you're in which way the mapping goes, which is just fantastic because you can't tell at the server which mode the client is in. IDNA is an unholy mess); but the lookup is still done using the IDNA2008 rules, approximately. Heh. And I just thought I was finally starting to get to grips with this stuff. They both encode in punycode to the same thing: xn--tesst.xmpl.com-cib7f2a Under no circumstances should they encode to that. Eurgh, you're right. The library I'm using does actually do it right, I just forgot to split on the dot and encode each label separately when writing the examples for this email. Sorry for confusing matters. [snip lots of useful and interesting information] You seem to want a bunch of label constraints, not all of which are related to IDNA. I think it would be better to break these up into a small number of functions. As it happens, I have a colleague at Dyn who I think has some need of some of this too, and so it might be worth spinning up a small project to try to get generic functions: to_idna2003, to_idna2008, check_ldh, split_labels, and so on. If this seems possibly interesting for collaboration, let me know I'll try to put together the relevant people. Those functions would be very useful to me. I know a bit of C, but probably not enough to produce an acceptable patch. If there are people who would also find these functions useful, and people motivated to implement them, that would be great... -- Mike Cardwell https://grepular.com https://emailprivacytester.com OpenPGP Key35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F XMPP OTR Key 8924 B06A 7917 AAF3 DBB1 BF1B 295C 3C78 3EF1 46B4 signature.asc Description: Digital signature
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
* on the Mon, Dec 29, 2014 at 07:25:59PM -0500, Andrew Sullivan wrote: This is exactly the same method that we commonly use for performing case insensitive text searches using lower() indexes. Hmm. How did you get the original, then? The original in my case, is the hostname which the end user supplied. Essentially, when I display it back to them, I want to make sure it is displayed the same way that it was when they originally submitted it. If you have the original Unicode version, why don't you switch to IDNA2008 publication rules, which are way more reliable? In that case, you do have a 1:1 lookup and you shouldn't have a problem. I was unaware of the different versions of IDNA. I basically started using the Perl module IDNA::Punycode in my project and assumed that this was the only type. Seems like I need to do some more reading. -- Mike Cardwell https://grepular.com https://emailprivacytester.com OpenPGP Key35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F XMPP OTR Key 8924 B06A 7917 AAF3 DBB1 BF1B 295C 3C78 3EF1 46B4 signature.asc Description: Digital signature
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
On Tue, Dec 30, 2014 at 12:53:42AM +, Mike Cardwell wrote: Hmm. How did you get the original, then? The original in my case, is the hostname which the end user supplied. Essentially, when I display it back to them, I want to make sure it is displayed the same way that it was when they originally submitted it. Ah. This gets even better™ for you, then, because whereas in IDNA2003 you can pass it an old fashioned LDH name (letter, digit, hypen), IDNA2008 treats those as _outside_ the spec. So basically, you first have to get a label and determine whether it is LDH or not (you can do this by checking for any octet outside the LDH range) and then you can decide which way to process it. In IDNA2003, the punycode output from an LDH label turns out always to be the LDH label. The reason for this is that you're supposed to validate that a U-label is really a U-label before registering in IDNA2008, and lots of perfectly good LDH labels (like EXAMPLE) are not valid under IDNA2008 because of upper case. (If by now you think that maybe it's time for this DNS thing to get replaced, you have company.) I was unaware of the different versions of IDNA. I basically started using the Perl module IDNA::Punycode in my project and assumed that this was the only type. Seems like I need to do some more reading. Yeah, this is all made much harder by the fact that several IDN libraries still do 2003. Here is one that many people are using for IDNA2008: https://gitorious.org/libidn2/libidn2/source/0d6b5c0a9f1e4a9742c5ce32b6241afb4910cae1: It's GPLv3, though, which brings its own issues. A -- Andrew Sullivan a...@crankycanuck.ca -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general