Re: [Distutils] PEP 470 discussion, part 3

Donald Stufft Wed, 23 Jul 2014 11:19:24 -0700

On July 23, 2014 at 1:09:00 PM, Richard Jones (r1chardj0...@gmail.com) wrote:
I have been mulling over PEP 470 for some time without having the time to truly 
dedicate to addressing it. I believe I'm up to date with its contents and the 
(quite significant, and detailed) discussion around it.


To summarise my understanding, PEP 470 proposes to remove the current link 
spidering (pypi-scrape, pypi-scrape-crawl) while retaining explicit hosting 
(pypi-explicit). I believe it retains the explicit links to external hosting 
provided by pypi-explicit.
No, it removes pypi-explicit as well, leaving only files hosted on PyPI. On top 
of that it adds a new functionality where project authors can indicate that 
their files are hosted on a non PyPI index. This allows tooling to indicate to 
users that they need to add additional indexes to their install commands in 
order to install something, as well as allowing PyPI to still act as a central 
authority for naming without forcing people to upload to PyPI.



The reason given for this change is the current bad user experience around the 
--allow-external and --allow-unverified options to pip install. That is, that 
users currently attempt to install a non-pypi-explicit package and the result 
is an obscure error message.
That’s part of the bad UX, the other part is that users are not particularly 
aware of the difference between an external vs an unverified link (in fact many 
people involved in packaging were not aware until it was explained to them by 
me, the difference is subtle). Part of the problem is while it’s easy for 
*tooling* to determine the difference between external and unverified, for a 
human it requires inspecting the actual HTML of the /simple/ page.



I believe the current PEP addresses the significant usability issues around 
this by swapping them for other usability issues. In fact, I believe it will 
make matters worse with potential confusion about which index hosts what, 
potential masking of release files or even, in the worst scenario, potential 
spoofing of release files by indexes out of the control of project owners.
So that’s a potential problem with any multi index thing yes. However I do not 
believe they are serious problems. It is a model that is in use by every linux 
vendor ever and anyone who has ever used a Linux (or most of the various BSDs) 
are already familiar with it. On top of that it is something that end users 
would need to be aware of if they want to use a private index, or they want to 
install commercial software that has a restricted index, or any other number of 
situations. In other words multiple indexes don’t go away, they will always be 
there. The effect of PEP 438 is that users need to be aware of *two* different 
ways of installing things not hosted on PyPI instead of just one. 

This two concepts instead of one is another part of the bad UX inflicted by PEP 
438. The zen states that there should be one way to do something, and I think 
that is a good thing to strive for. 



I would like us to consider instead working on the usability of the existing 
workflow, by rather than throwing an error, we start a dialog with the user:

$ pip install PIL
Downloading/unpacking PIL
  PIL is hosted externally to PyPI. Do you still wish to download it? [Y/n] y
  PIL has no checksum. Are you sure you wish to download it? [Y/n] y
Downloading/unpacking PIL
  Downloading PIL-1.1.7.tar.gz (506kB): 506kB downloaded
...

Obviously this would require scraping the site, but given that this interaction 
would only happen for a very small fraction of projects (those for which no 
download is located), the overall performance hit is negligible. The PEP 
currently states that this would be a "massive performance hit" for reasons I 
don't understand.
It’s a big performance hit because we can’t just assume that if there is a 
download located on PyPI that there is not a better download hosted externally. 
So in order to actually do this accurately then we must scan any URL we locate 
in order to build up an entire list of all the potential files, and then ask if 
the person wants to download it.

For a sort of indication of the difference, I can scan all of PyPI looking for 
potential release files in about 20 minutes if I restrict myself to only things 
hosted directly on PyPI. If I include the additional scanning then that time 
jumps up to 3-4 hours. That’s what, 13x slower? And that’s with an incredibly 
aggressive timeout and a blacklist to only try bad hosts once.



The two prompts could be made automatic "y" responses for tools using the 
existing --allow-external and --allow-unverified flags.

I also note that PEP 470 says "PEP 438 proposed a system of classifying file 
links as either internal, external, or unsafe", whereas PEP 438 has no mention 
of "unsafe". This leads "unsafe" to never actually be defined anywhere that I 
can see.
I can define them in the PEP, but basically:

* internal - Things hosted by PyPI itself.

* external - Things hosted off of PyPI, but linked directly from the /simple/ 
page with an acceptable hash

* unsafe - Things hosted off of PyPI, either linked directly from the /simple/ 
page *without* an acceptable hash, or things hosted on a page which is linked 
from a rel=“homepage” or rel=“download” link.



Finally, the Rejected Proposals section of the PEP appears to have a couple of 
justifications for rejection which have nothing whatsoever to do with the 
Rationale ("PyPI is fronted by a globally distributed CDN...", "PyPI supports 
mirroring...") As Holger has already indicated, that second one is going to 
have a heck of a time dealing with PEP 470 changes at least in the devpi case.
PEP 381 mirroring will require zero changes to deal with the proposed change 
since it explicitly requires that the mirror client download the HTML of the 
/simple/ page and serve it unmodified. If devpi requires changes that is 
because it does not follow the documented protocol.

Those additional justifications are why we need a much clearer line between 
what is available on the PyPI repository, and what is available elsewhere. They 
are why we can’t just eliminate the ``—allow-external`` case (which is safe, 
but has availability and speed concerns).



 "PyPI has monitoring and an on-call rotation of sysadmins..." would be solved 
through improving the failure message reported to the user as discussed above.
We can’t have better failure messages because we don’t have any idea if a 
particular URL is expected to be up or if it has bit rotted to death and thus 
is an expected failure. Because of this pip has to more or less silently ignore 
failing URLs and ends up presenting very confusing error messages.

Forgive me if these don’t make sense, I’m real sick today.

-- 
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

_______________________________________________
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig

Re: [Distutils] PEP 470 discussion, part 3

Reply via email to