On 04/06/2017 07:59 PM, Heikki Linnakangas wrote:
Another thing I'd like some more eyes on, is how this will work with
encodings other than UTF-8. We will now try to normalize the password as
if it was in UTF-8, even if it isn't. That's OK as long as we're
consistent about it, but there is one worrisome scenario: what if the
user's password consists mostly of characters, that when interpreted as
UTF-8, are in the list of ignored characters. IOW, is it realistic that
a user might have a password in a non-UTF-8 encoding, that gets silently
mangled into something much shorter? I think that's highly unlikely, but
can anyone come up with a plausible example of that?

I did some testing on what the byte sequences for the Unicode characters that SASLprep ignores mean in other encodings. I created a text file containing every ignored character, in UTF-8, and ran "iconv -f <other encoding> -t UTF-8//TRANSLIT" on the file, using all supported server encodings. The idea is to take each of the ignored byte sequences, and pretend that they are in some other encoding. If converting them to UTF-8 results in a legit character, then that character means something in that encoding, and could be misinterpreted if it's used in a password.

Here are some characters that seem plausible to be misinterpreted and ignored by SASLprep:

-------
EUC-JP and EUC-JISX0213:

U+00AD (C2 AD): 足 (meaning "foot", per Unihan database)
U+FE00-FE0F (EF B8 8X): 鏝 (meaning "trowel", per Unihan database)

EUC-CN:

U+00AD (C2 AD): 颅 (meaning "skull", per Unihan database)
U+FE00-FE0FF (EF B8 8X): 锔 (meaning "curium", per Unihan database)
U+FEFF (EF BB BF): 锘 (meaning "nobelium", per Wikipedia)

EUC-KR:

U+FE00-FE0F (EF BB BF): 截 (meanings "cut off, stop, obstruct, intersect", per Unihan database
U+FEFF (EF BB BF): 癤 (meanings "pimple, sore, boil", per Unihan database)

EUC-TW:
U+FE00-FE0F: 踫 (meanings "collide, bump into", per Unihan database)
U+FEFF: 踢  (meaning "kick", per Unihan database)

CP866:
U+1806: саЖ
U+180B: саЛ
U+180C: саМ
U+180D: саН
U+200B: тАЛ
U+200C: тАМ
U+200D: тАН
-------

The CP866 cases seem most likely to cause confusion. Those are all common words in Russian. I don't know how common those Chinese/Japanese characters are.

Overall, I think this is OK. Even though there are those characters that can be misinterpreted, for it to be problem all of the following have to be true:

1. The client is using one of those encodings.
2. The password string as whole has to look like valid UTF-8.
3. Ignoring those characters/words from the password would lead to a significantly weaker password, i.e. it was not very long to begin with, or it consisted almost entirely of those characters/words.

Thoughts? Attached is the full results of running iconv with each encoding, from which I picked the above cases.

- Heikki

Attachment: saslprep-confusing-chars.txt.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to