On May 15, 2013, at 10:31 AM, Daniel Holth <[email protected]> wrote:

> How to avoid confusables.
> 
> These scripts are recommended for use in identifiers:
> http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
> 
> This report details a confusables detection algorithm:
> http://www.unicode.org/reports/tr39/#Confusable_Detection
> 
> And ICU implements it:
> http://www.icu-project.org/apiref/icu4c/uspoof_8h.html (see also
> PyICU).
> 
> The package index would enforce uniqueness of the "skeleton" of each
> registered package which is just an internal normalization based on
> confusability. if skeleton(identifier1) == skeleton(identifier2) then
> id1 and id2 are confusable.
> 
> The tooling could get away with a simpler rule like
> re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)
> 
> As a bonus to including the world, this should be able to prevent
> people from exchanging zeroes for capital O.
> 
> On Wed, May 15, 2013 at 7:17 AM, Eric V. Smith <[email protected]> wrote:
>> On 05/15/2013 07:10 AM, Donald Stufft wrote:
>>>>>> Anyone want to run a scan over the PyPI package set to see
>>>>>> how many packages would cause problems for a "[a-zA-Z0-9_.-]"
>>>>>> only filter?
>>>>> 
>>>>> See my previous email where I did queries against my local DB.
>>>>> It's 225 total projects that wouldn't be allowed.
>>>> 
>>>> Can you send the list of those projects?
>>>> 
>>>> Eric.
>>>> 
>>> 
>>> Here you go https://gist.github.com/dstufft/5583225 used a Python
>>> oneliner and the PyPI API so others can reproduce easily if they
>>> wish.
>> 
>> Perfect. Thanks.
>> 
>> It looks like space causes most of the issues. I'm not sure how
>> "Twisted Flow >= 1.0" would be expected to parse.
>> 
>> Eric.
>> 
>> 
>> _______________________________________________
>> Distutils-SIG maillist  -  [email protected]
>> http://mail.python.org/mailman/listinfo/distutils-sig



This gets into an area that is both complicated to setup, more complicated to 
maintain, and harder to explain to people.

I also cannot find any data on if the confusables list is whitelist or 
blacklist, but given it's nature of a list of characters that are confusing 
then I'm going to guess it's a blacklist which means it's very possible (and 
likely) that there are missing glyphs there that can easily be confused for 
each another.

It also doesn't solve the problem that these names can and will be used in 
systems outside of a Python runtime that may or may not support unicode 
characters so it affords a much smaller window of compatibility.

It also makes the urls a whole heck of a lot less nice.

All for something that people haven't really even attempted to use (here's a 
total list of things that have ever been registered to PyPI with a name that 
uses unicode):

Manual de Py2Exe en Español
flügelform
☃
t☃
py<U+1F4A9>
<U+2063>_init__
D<U+2063>jango
D\x01jango
pyramid-✔

The vast bulk of them being people either attempting to play with unicode or 
people attempting to do exactly as I outlined as a potential threat.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Distutils-SIG maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/distutils-sig

Reply via email to