subject:"\[GENERAL\] Hostnames, IDNs, Punycode and Unicode Case Folding"

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2015-01-05 Thread Jonathan Vanasco


On Dec 29, 2014, at 5:36 PM, Mike Cardwell wrote:

 So the system I've settled with is storing both the originally supplied
 representation, *and* the lower cased punycode encoded version in a separate
 column for indexing/search. This seems really hackish to me though.

I actually do the same exact thing and don't think it's hackish.  I actually 
really like being able to see the punycode next to the human representation on 
simple db pulls.  It's barely more disk space and really useful.  



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell

I'd like to store hostnames in a postgres database and I want to fully support
IDNs (Internationalised Domain Names)

I want to be able to recover the original representation of the hostname, so I
can't just encode it with punycode and then store the ascii result. For example,
these two are the same hostnames thanks to unicode case folding [1]:

  tesst.ëxämplé.com
  teßt.ëxämplé.com

They both encode in punycode to the same thing:

  xn--tesst.xmpl.com-cib7f2a

Don't believe me, then try visiting any domain with two s's in, whilst replacing
the s's with ß's. E.g:

  ericßon.com
  nißan.com
  americanexpreß.com

So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no idea
which of those two hostnames was the original representation.

The trouble is, if I store the unicode representation of a hostname instead,
then when I run queries with conditions like:

  WHERE hostname='nißan.com'

that wont pull out rows where hostname='nissan.com'. Also, if I create a unique
index on the hostname field, or even on lower(hostname), that wont stop the
same hostname being inserted more than once, with slightly different
representations.

So the system I've settled with is storing both the originally supplied
representation, *and* the lower cased punycode encoded version in a separate
column for indexing/search. This seems really hackish to me though.

It seems to me that Postgres would benefit from a native hostname type and/or
a pair of punycode encode/decode functions. And perhaps even a simple unicode
case folding function. With the end result that these return TRUE:

  unicode_case_fold('ß') = 'ss'

  'xn--tesst.xmpl.com-cib7f2a' = punycode_encode('teßt.ëxämplé.com')

  punycode_decode('xn--tesst.xmpl.com-cib7f2a') = 'tesst.ëxämplé.com'

A native type would also be able to apply suitable constraints, e.g a maximum
length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a
limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens
on a label, etc.

I initially created a pair of functions for using punycode using plperl and the
IDNA::Punycode Perl module, but the performance wasn't good enough, so I settled
with duplicating the data to sacrifice disk space and complexity, for speed.

I'm new to Postgres, and to this list, so if there is a better way for me to
submit this suggestion or solve my problem, please point me in the right
direction.

[1] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

Regards,

-- 
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4


signature.asc
Description: Digital signature

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andy Colson


On 12/29/2014 4:36 PM, Mike Cardwell wrote:

I'd like to store hostnames in a postgres database and I want to fully support
IDNs (Internationalised Domain Names)

I want to be able to recover the original representation of the hostname, so I
can't just encode it with punycode and then store the ascii result. For example,
these two are the same hostnames thanks to unicode case folding [1]:

   tesst.ëxämplé.com
   teßt.ëxämplé.com

They both encode in punycode to the same thing:

   xn--tesst.xmpl.com-cib7f2a

Don't believe me, then try visiting any domain with two s's in, whilst replacing
the s's with ß's. E.g:

   ericßon.com
   nißan.com
   americanexpreß.com

So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no idea
which of those two hostnames was the original representation.

The trouble is, if I store the unicode representation of a hostname instead,
then when I run queries with conditions like:

   WHERE hostname='nißan.com'



_IF_ Postgres had a punycode function, then you could use:
WHERE punycode(hostname) = punycode('nißan.com')


-Andy



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread David G Johnston

Andy Colson wrote
 On 12/29/2014 4:36 PM, Mike Cardwell wrote:
 I'd like to store hostnames in a postgres database and I want to fully
 support
 IDNs (Internationalised Domain Names)

 I want to be able to recover the original representation of the hostname,
 so I
 can't just encode it with punycode and then store the ascii result. For
 example,
 these two are the same hostnames thanks to unicode case folding [1]:

tesst.ëxämplé.com
teßt.ëxämplé.com

 They both encode in punycode to the same thing:

xn--tesst.xmpl.com-cib7f2a

 Don't believe me, then try visiting any domain with two s's in, whilst
 replacing
 the s's with ß's. E.g:

ericßon.com
nißan.com
americanexpreß.com

 So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no
 idea
 which of those two hostnames was the original representation.

 The trouble is, if I store the unicode representation of a hostname
 instead,
 then when I run queries with conditions like:

WHERE hostname='nißan.com'

 
 _IF_ Postgres had a punycode function, then you could use:
 WHERE punycode(hostname) = punycode('nißan.com')

If the OP wraps what he is doing up into a function that is what you end up
getting: a memoized punycode function.

http://en.wikipedia.org/wiki/Memoization

It has to be defined as volatile but basically write the function to check
for the provided input on the indexed table and if it doesn't exist the
function will calculate the punycode value and store it onto the table
before returning the punycode value to the caller.

The question then becomes how to perform the actual punycode conversion.

The API within PostgreSQL becomes that single function and you store only
the original hostname in the user-data area.  Whenever you need to access
the punycode version you call the function.

The only downside is that because the function is volatile you cannot create
a functional index using it.  It may be worth writing both an immutable
conversion function and a memoization wrapper so you can index using the
former while using the memoized version normally.  

I'm getting a little outside my experience level here as it seems there
should be a better way to define a volatile function that is deterministic
so that you get the best of both worlds.  Just because a function modifies
the database doesn't mean that its output value is not solely a function of
its inputs - and that all side effects are of a logging or caching nature.

David J.



--
View this message in context: 
http://postgresql.nabble.com/Hostnames-IDNs-Punycode-and-Unicode-Case-Folding-tp5832350p5832360.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell

* on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote:

WHERE hostname='nißan.com'

 
 _IF_ Postgres had a punycode function, then you could use:
 WHERE punycode(hostname) = punycode('nißan.com')
 
 If the OP wraps what he is doing up into a function that is what you end up
 getting: a memoized punycode function.
 
 http://en.wikipedia.org/wiki/Memoization
 
 It has to be defined as volatile but basically write the function to check
 for the provided input on the indexed table and if it doesn't exist the
 function will calculate the punycode value and store it onto the table
 before returning the punycode value to the caller.

I'm not sure all that is necessary. It could be quite a simple function,
like the lower() function. So what I would do is this:

CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames 
(lower(punycode_encode(hostname)));

That would prevent adding more than one representation for the same hostname
to the column. And I if I wanted to do a fast, indexed search where I could
supply any representation of the hostname as input, I would just do:

WHERE lower(punycode_encode(hostname)) = 
lower(punycode_encode('any-representation'))

There doesn't need to be any extra table storage for the punycode encoded
version.

-- 
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4


signature.asc
Description: Digital signature

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan

On Mon, Dec 29, 2014 at 11:50:54PM +, Mike Cardwell wrote:
 
 CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames 
 (lower(punycode_encode(hostname)));

This wouldn't work to get the original back if oyu have any IDNA2003
data, because puncode-encoding the UTF-8 under IDNA2003 and then
punycode-decoding it doesn't always result in the same label.  See my
other message.

Did I mention that IDNA is a mess?

A

-- 
Andrew Sullivan
a...@crankycanuck.ca


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread David G Johnston

On Mon, Dec 29, 2014 at 4:51 PM, Mike Cardwell [via PostgreSQL] 
ml-node+s1045698n583236...@n5.nabble.com wrote:

 * on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote:

 WHERE hostname='nißan.com http://nissan.com'
 
 
  _IF_ Postgres had a punycode function, then you could use:
  WHERE punycode(hostname) = punycode('nißan.com http://nissan.com')
 
  If the OP wraps what he is doing up into a function that is what you end
 up
  getting: a memoized punycode function.
 
  http://en.wikipedia.org/wiki/Memoization
 
  It has to be defined as volatile but basically write the function to
 check
  for the provided input on the indexed table and if it doesn't exist the
  function will calculate the punycode value and store it onto the table
  before returning the punycode value to the caller.
 I'm not sure all that is necessary. It could be quite a simple function,
 like the lower() function. So what I would do is this:

 CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames
 (lower(punycode_encode(hostname)));

 That would prevent adding more than one representation for the same
 hostname
 to the column.


Except two different hostname can resolve to the same
punycode_encode(hostname) value so the unique index won't work.
It was also mentioned that using the Perl encoding function was
non-performant; which is why caching the data into a memoization table has
value.


WHERE lower(punycode_encode(hostname)) =
 lower(punycode_encode('any-representation'))


I'm not for knowing the rules of punycode but I'm not seeing what value
lower() provides here...




 There doesn't need to be any extra table storage for the punycode encoded
 version.


David J.




--
View this message in context: 
http://postgresql.nabble.com/Hostnames-IDNs-Punycode-and-Unicode-Case-Folding-tp5832350p5832368.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Tom Lane

David G Johnston david.g.johns...@gmail.com writes:
 It was also mentioned that using the Perl encoding function was
 non-performant; which is why caching the data into a memoization table has
 value.

I find it hard to believe that the standards folk would have chosen a
hostname encoding method that was so inefficient that memoization using
a database table would outperform just recomputing the encoding.  What
I suspect the above really means is man, this Perl code sucks.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell

* on the Mon, Dec 29, 2014 at 07:00:05PM -0500, Andrew Sullivan wrote:

 CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames 
 (lower(punycode_encode(hostname)));

 This wouldn't work to get the original back if oyu have any IDNA200
 data, because puncode-encoding the UTF-8 under IDNA2003 and the
 punycode-decoding it doesn't always result in the same label.  See my
 other message.

The original is the thing that is stored in the database. I wouldn't need to
do any conversion to get the original back. In my example I am storing
the original and creating an index on the punycode version.

This is exactly the same method that we commonly use for performing case
insensitive text searches using lower() indexes.

-- 
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4


signature.asc
Description: Digital signature

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan

[resending, because somehow this got routed through my work address the first 
time.]

Hi,

I didn't have time to write a short note, so I wrote a long one
instead.  Sorry.

On Mon, Dec 29, 2014 at 10:36:42PM +, Mike Cardwell wrote:
 can't just encode it with punycode and then store the ascii result. For 
 example,
 these two are the same hostnames thanks to unicode case folding [1]:
 
   tesst.ëxämplé.com
   teßt.ëxämplé.com

Well, in IDNA2003 they're the same.  In IDNA2008 (RFC 5890 and suite),
they're not the same.  In UTS46, they're kind of the same, because
pre-lookup processing maps one of them to the other (it depends which
mode you're in which way the mapping goes, which is just fantastic
because you can't tell at the server which mode the client is in.
IDNA is an unholy mess); but the lookup is still done using the
IDNA2008 rules, approximately.

 They both encode in punycode to the same thing:
 
   xn--tesst.xmpl.com-cib7f2a

Under no circumstances should they encode to that.  IDNA, either 2003
or 2008, is label by label, not for entire domain names.  (The labels
are the things between the dots in a presentation-format FQDN.)

So, under IDNA2003, you should get tesst.xn--xmpl-loa7ai.com for both
of them.  Under IDNA2008, you should get that also for the first of
them, and xn--tet-6ka.xn--xmpl-loa7ai.com for the second.  Most
clients, however, are probably going to want to do something UTS46ish,
but there's no way to guarantee that the IDNA2003 and IDNA2008 labels
go together (this is called variants, and you do not want to know
how much horror that has caused).

Worse, of course, you also don't know that what you have isn't just
raw UTF-8 in the label, but let's boil one ocean at a time.
 
 Don't believe me, then try visiting any domain with two s's in, whilst 
 replacing
 the s's with ß's. E.g:
 
   ericßon.com
   nißan.com
   americanexpreß.com

This depends entirely on which version of IDNA you're using.  Many
browsers right now officially do IDNA2003.  Unfortunately, they don't
_actually_ do that either because IDNA2003 is nominally nailed to
Unicode version 3.2, and there's approximately zero chance that a
running computer on the Internet these days is using such an old
Unicode version.

IDNA2008, by the way, wasn't something we did (I am one of the people
you can blame for this) for fun.  The very problem you're noting in
IDNA2003 is one of the things we were trying to fix.  Under IDNA2003,
the Unicode-Punycode-Unicode round trip could lose data.  Under
IDNA2008, this is fixed: every A-label (the xn--Punycodehere version)
corresponds to exactly one U-label (the Unicode representation) and
conversely.  (It follows from this that lots of Unicode strings aren't
U-labels, because there are a lot of rules about what can be a
U-label.  For instance, capital letters aren't allowed, because
they're not stable under caseFold.  This is all in the IDNA2008 RFCs,
but they're not an easy read.  We tried.)

 So if I pull out xn--tesst.xmpl.com-cib7f2a from the database, I've no idea
 which of those two hostnames was the original representation.

None.  What you need to do is split the name on label boundaries
(which is hard, because believe it or not . is a valid character in
the DNS, but splitting on the . character is probably as good as you
can do here.  But look for escaped ones).  Then you can check for
validity under IDNA2008 and IDNA2003, and then you can run it through
Punycode.

ICANN, for whatever it's worth, is using IDNA2008 rules for top-level
domains and as part of its IDNA guidelines, so over time the actually
_registered_ names are going to be either IDNA2008 or, at worst,
UTS46.

 The trouble is, if I store the unicode representation of a hostname instead,
 then when I run queries with conditions like:
 
   WHERE hostname='nißan.com'
 
 that wont pull out rows where hostname='nissan.com'.

Right.  If I were doing this, I think I'd probably create two
functions, one to do IDNA2003 and one to do IDNA2008.  Then I'd put a
functional index on it for both cases.  Eventually, you'll be able to
drop the IDNA2003 lookup because everything will conform to IDNA2003.
(Note that the WHATWG, which the W3C is going to listen to but which
is not a W3C WG, appears to be trying to undo that; but IDNA2003 is
irredeemably broken.  So there is a mess here brewing.)  

 So the system I've settled with is storing both the originally supplied
 representation, *and* the lower cased punycode encoded version in a separate
 column for indexing/search. This seems really hackish to me though.

Well, see above.  The other way I'd do it is to store _both_ the
IDNA2003 punycode and also the IDNA2008 A-label.  The reason I hate
this is because the lookup is insanely complicated.

 It seems to me that Postgres would benefit from a native hostname type and/or
 a pair of punycode encode/decode functions.

A pair won't work.

 And perhaps even a simple unicode case folding function.

Unicode case folding is _way_ more

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan

On Tue, Dec 30, 2014 at 12:18:58AM +, Mike Cardwell wrote:
 
 This is exactly the same method that we commonly use for performing case
 insensitive text searches using lower() indexes.

Hmm.  How did you get the original, then?  If you have the original
Unicode version, why don't you switch to IDNA2008 publication rules,
which are way more reliable?  In that case, you do have a 1:1 lookup
and you shouldn't have a problem.  

If you need variants, then you have a different problem, but that
actually can be specified for the much narrower range of UTF-8
permissible under IDNA2008.

A

-- 
Andrew Sullivan
a...@crankycanuck.ca


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell

* on the Mon, Dec 29, 2014 at 05:01:44PM -0700, David G Johnston wrote:

 CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames
 (lower(punycode_encode(hostname)));

 That would prevent adding more than one representation for the same
 hostname
 to the column.
 
 Except two different hostname can resolve to the same
 punycode_encode(hostname) value

That's exactly what I'm taking advantage of...

 so the unique index won't work.

I think you misunderstand what I'm suggesing. I'm suggesting storing the
unicode version in the database, and then creating a unique index on the
result of the punycode function to prevent more than one representation
of the same hostname being inserted.

 It was also mentioned that using the Perl encoding function was
 non-performant; which is why caching the data into a memoization table has
 value.

It was non-performant because it involved unnecessarily starting a Perl
interpreter. Not because the algorithm its self is slow.
 
 WHERE lower(punycode_encode(hostname)) =
  lower(punycode_encode('any-representation'))
 
 
 I'm not for knowing the rules of punycode but I'm not seeing what value
 lower() provides here...

Case insensitive matching. So that EXAMPLE.COM = example.com

-- 
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4


signature.asc
Description: Digital signature

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell

* on the Mon, Dec 29, 2014 at 07:22:21PM -0500, Andrew Sullivan wrote:

 can't just encode it with punycode and then store the ascii result. For 
 example,
 these two are the same hostnames thanks to unicode case folding [1]:
 
   tesst.ëxämplé.com
   teßt.ëxämplé.com
 
 Well, in IDNA2003 they're the same.  In IDNA2008 (RFC 5890 and suite),
 they're not the same.  In UTS46, they're kind of the same, because
 pre-lookup processing maps one of them to the other (it depends which
 mode you're in which way the mapping goes, which is just fantastic
 because you can't tell at the server which mode the client is in.
 IDNA is an unholy mess); but the lookup is still done using the
 IDNA2008 rules, approximately.

Heh. And I just thought I was finally starting to get to grips with this stuff.
 
 They both encode in punycode to the same thing:
 
   xn--tesst.xmpl.com-cib7f2a
 
 Under no circumstances should they encode to that.

Eurgh, you're right. The library I'm using does actually do it right, I just
forgot to split on the dot and encode each label separately when writing the
examples for this email. Sorry for confusing matters.

[snip lots of useful and interesting information]

 You seem to want a bunch of label constraints, not all of which are
 related to IDNA. I think it would be better to break these up into a
 small number of functions.  As it happens, I have a colleague at Dyn
 who I think has some need of some of this too, and so it might be
 worth spinning up a small project to try to get generic functions:
 to_idna2003, to_idna2008, check_ldh, split_labels, and so on.  If this
 seems possibly interesting for collaboration, let me know  I'll try
 to put together the relevant people.

Those functions would be very useful to me. I know a bit of C, but probably not
enough to produce an acceptable patch. If there are people who would also find
these functions useful, and people motivated to implement them, that would
be great...

-- 
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4


signature.asc
Description: Digital signature

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell

* on the Mon, Dec 29, 2014 at 07:25:59PM -0500, Andrew Sullivan wrote:

 This is exactly the same method that we commonly use for performing case
 insensitive text searches using lower() indexes.

 Hmm.  How did you get the original, then?

The original in my case, is the hostname which the end user supplied.
Essentially, when I display it back to them, I want to make sure it is
displayed the same way that it was when they originally submitted it.

 If you have the original
 Unicode version, why don't you switch to IDNA2008 publication rules,
 which are way more reliable?  In that case, you do have a 1:1 lookup
 and you shouldn't have a problem.

I was unaware of the different versions of IDNA. I basically started using
the Perl module IDNA::Punycode in my project and assumed that this was the
only type. Seems like I need to do some more reading.

-- 
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4


signature.asc
Description: Digital signature

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan

On Tue, Dec 30, 2014 at 12:53:42AM +, Mike Cardwell wrote:
  Hmm.  How did you get the original, then?
 
 The original in my case, is the hostname which the end user supplied.
 Essentially, when I display it back to them, I want to make sure it is
 displayed the same way that it was when they originally submitted it.

Ah.  This gets even better™ for you, then, because whereas in IDNA2003
you can pass it an old fashioned LDH name (letter, digit, hypen),
IDNA2008 treats those as _outside_ the spec.  So basically, you first
have to get a label and determine whether it is LDH or not (you can do
this by checking for any octet outside the LDH range) and then you can
decide which way to process it.  In IDNA2003, the punycode output from
an LDH label turns out always to be the LDH label.  The reason for
this is that you're supposed to validate that a U-label is really a
U-label before registering in IDNA2008, and lots of perfectly good LDH
labels (like EXAMPLE) are not valid under IDNA2008 because of upper
case.

(If by now you think that maybe it's time for this DNS thing to get
replaced, you have company.)

 I was unaware of the different versions of IDNA. I basically started using
 the Perl module IDNA::Punycode in my project and assumed that this was the
 only type. Seems like I need to do some more reading.

Yeah, this is all made much harder by the fact that several IDN
libraries still do 2003.  Here is one that many people are using for
IDNA2008:
https://gitorious.org/libidn2/libidn2/source/0d6b5c0a9f1e4a9742c5ce32b6241afb4910cae1:
It's GPLv3, though, which brings its own issues.

A

-- 
Andrew Sullivan
a...@crankycanuck.ca


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

[GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

15 matches

Site Navigation

Mail list logo

Footer information