Hello Everyone! I had previously sent a mail on this list, stating that I would like to work on pip's dependency resolution for my GSoC 2017. I now have drafted a proposal for the same; with help from my mentors - Donald Stufft and Justin Cappos. I'd also take this opportunity to thank them for agreeing to be my mentors for this GSoC.
I would like to request for comments on my proposal for GSoC - it is hosted at https://gist.github.com/pradyunsg/5cf4a35b81f08b6432f280aba6f511eb. Please find trailing a MarkDown version of the proposal. Thanks, Pradyun Gedam ----- # Adding Proper Dependency Resolution to pip - **Name:** Pradyun S. Gedam - **Email:** [pradyu...@gmail.com][mailto-email] - **Github:** [pradyunsg][github-profile] - **University:** [VIT University, Vellore, India][vit-homepage] - **Course:** Bachelor of Technology in Computer Science and Engineering - **Course Term:** 2016/17 - 2019/20 (4 Year) - **Timezone:** IST (GMT +5:30) - **GSoC Blog RSS Feed URL:** < https://pradyunsg.github.io/gsoc-2017/feed.xml> [github-profile]: http://github.com/pradyunsg/ [vit-homepage]: http://vit.ac.in/ [mailto-email]: mailto:pradyu...@gmail.com ## About Me I was introduced to Python about five years ago, through "Core Python Programming" by Weasley J Chun. After the initial struggle with Python 2/3, the ball was set rolling and hasn't stopped since. I have fiddled around with Game Programming (PyGame), Computer Vision (OpenCV), Data Analytics (Pandas, SciPy, NumPy), transcompilers (Py2C) and more. As a high school student, I did internship at Enthought in 2013 and 2014. The two summers that I spent at Enthought were a great learning experience and I thoroughly enjoyed the environment there. Other than Python, I have also used C, C++, Web Technologies (JavaScript, HTML, CSS) and preprocessors (Pug, TypeScript, LESS/SASS/SCSS), Java and Bash/Zsh for some other projects. Curious to understand how pip works, I began looking into pip's source code. I started contributing to pip in May 2016. I am now fairly familiar with the design of pip and have a fair understanding of how it works, due to the various contributions I have made to pip in the past year. ## Mentors - Donald Stufft (@dstufft on GitHub) - Justin Cappos (@JustinCappos on GitHub) Communication with the mentors will be done mostly on issues and pull requests on pip's GitHub repository. If at any point in time, a real time discussion is to be done with the mentors, it would be done over IRC or Skype. Email can also be used if needed. ## Proposal This project is regarding improving dependency resolution performed within pip by implementing a dependency resolver within it. ### Abstract Currently, pip does not resolve dependencies correctly when there are conflicting requirements. The lack of dependency resolution has caused hard-to-debug bugs/failures due to the installation of incompatible packages. The lack of a dependency resolver is also a blocker for various other features - adding an upgrade-all functionality to pip and properly determining build-time dependencies for packages are two such features. ### Deliverables At the end of this project, pip will have the ability to: - detect requirement conflicts - resolve conflicting dependency requirements where possible ### Implementation The implementation language will be Python. The code will maintain the compatibility requirements of pip - the same source code will support the multiple Python implementations and version, including but not limited to, CPython 2.7, CPython 3.3+, PyPy 2, PyPy3. New Tests for the functionality introduced will be added to pip's current test suite. User documentation would not be a major part of this project. The only changes would be to mention pip can now resolve dependencies properly. There are certain sections that might need updating. #### Overview The project will be composed of the following stages: 1. Refactor the dependency resolution logic of pip into a separate module. 1. Implement dependency information caching in pip. 1. Implement a backtracking dependency resolver to resolve the dependencies. Every stage depends on the previous ones being completed. This step-wise approach would make incremental improvements so that there is a constant feedback on the work being done as well as making it easy to change course without losing existing work; if needed for some unforeseen reason. #### Discussion There is a class in pip - `RequirementSet`, which is currently a God class. It is responsible for the following: 1. Holding a list of things to be installed 1. Downloading Files 1. Dependency Resolution 1. Unpacking downloaded files (preparation for installation) 1. Uninstalling packages 1. Installing packages This is clearly a bad situation since this is most of the heavy lifting involved in pip. These responsibilities need to be separated and moved out into their independent modules/classes, to allow for simplification of `RequirementSet` while providing a clean platform for the remaining work. This is the most tricky portion of this project, given the complexity of `RequirementSet` as it stands today. There are two kinds of distributions that may be used to install a package - wheels (binary) and sdists (source). When installing a package, pip builds a wheel from an sdist and then proceeds to install the wheel. The difference between the two formats of distribution relevant to this project is - wheels store the information about dependencies within them statically; sdists do not. Determining the dependencies of a wheel distribution is merely the matter of fetching the information from a METADATA file within the `.whl` file. The dependency information for an sdist, on the other hand, can only be determined after running its `setup.py` file on the target system. This means that dependencies of an sdist depend on how its `setup.py` behaves which can change due to variations on target systems or could even contain through random choices. Computation of an sdist's dependencies on the target system is a time-consuming task since it potentially involves fetching a package from PyPI and executing it's setup.py to get the dependency information. In order to improve performance, once an sdist's dependencies are computed, they would be stored in a cache so that during dependency resolution, the dependencies of an sdist are not computed every time they are needed. Further, pip caches wheels it downloads or builds meaning that any installed package or downloaded wheel's dependency information would available statically, without the need to go through the sdist dependency cache. Like the wheel cache, sdist-dependency-cache will be a file system based cache. The sdist-dependency-cache would only be used if the corresponding sdist is being used. Since sdist dependencies are non-deterministic, the cached dependency information is potentially incorrect - in certain corner cases such as using random choices in setup.py files. Such uses are not seen as important enough to cater to, compared the benefits of having a cache. Further, this is already the case with the information in the wheel cache. SAT solver based resolution is not feasible for pip since a SAT solver needs the entire set of packages and their dependencies to compute a solution, which cannot be generated due to the aforementioned non-deterministic behaviour of setup.py file. Computing dependency information for all of PyPI on a target system for installing a package is simply not feasible. The most reasonable approach is using a backtracking solver. Such a solver can be incrementally provided information about the dependencies of a package and would only need dependency information about packages in the dependency graph of the current system. There is a need to keep a cache of visited packages during dependency resolution. A certain package-version combination may be reached via multiple paths and it is an inefficient use of computation time to re-compute that whether it is indeed going to satisfy the requirements or not. By storing information about which package-version combinations have been visited and do (or do not) satisfy the constraints, it is possible to speedup the resolution. Consider the following example: ``` A-1.0 (depends: B) A-2.0 (depends: B and E-1.0) B-1.0 (depends: C and D) C-1.0 (depends: D) D-1.0 D-1.1 (depends: E-2.0) E-1.0 E-2.0 ``` If an installation of A is required, either A-2.0 or D-1.1 should not be installed because they have a conflicting requirement - E. While there are multiple possible solutions to this situation, the "most obvious" one us to use the D-1.0, instead of not installing A-2.0. Further, as multiple packages depend on D, the algorithm would "reach it" multiple times. By maintaining a cache for the visited packages, it is possible to achieve a speedup in such a scenario. #### Details Pull requests would be made on a regular basis during the project to ensure that the feedback loop is quick. This also reduces the possibilities of conflicts due to unrelated changes in pip. All the code will be tested within pip's existing testing infrastructure. It has everything needed to write tests related to all the changes to be made. Every PR made to pip as a part of this project will contain new tests or modifications to existing ones as needed. ##### Stage 1 Initially, some abstractions will be introduced to the pip codebase to improve the reuse of certain common patterns within the codebase. This includes cleaner temporary directory management through a `TempDirectory` class. `RequirementSet.prepare_files` and `RequirementSet._prepare_file` are downloading, unpacking packages as well as doing what pip does as dependency resolution. Taking these functions apart neatly is going to be a tricky task. The following is a listing of the final modules that will be responsible for the various tasks that are currently being performed by `RequirementSet`: - `pip.resolve` - Dependency Resolution - `pip.download` - Downloading Files & Unpacking downloaded files - `pip.req.req_set` - Holding a list of things to be installed / uninstalled - `pip.operations.install` - Installing Packages (preparation for installation) - `pip.operations.uninstall` - Uninstalling Packages To be able to proceed to the next step, only the dependency resolution related code needs to be refactored into a separate module. Other portions of `RequirementSet` are not required to be refactored but the same will be tackled as optional deliverables. In other words, only `pip.resolve` needs to be completed to be able to proceed to the next stage in this project. This is needed since in Stage 3, the resolver would be written in `pip.resolve`, independent of the rest of the codebase. ##### Stage 2 A new module `pip.cache` will be created. Within this module, all the caching will be handled. Thus, the code for the current wheel cache would be moved. The new code for a dependency cache for sdists would also be written here. The new cache would hold all of an sdist's egg-info. The information will be stored on the file-system in a sub directory structure much like that of the wheel cache, in a directory structure based on hash of the sdist file holding the egg-info at the end. This will be done in a class named `EggInfoCache`. `EggInfoCache` cache will be used only if a corresponding wheel for an sdist is not available. Installing an sdist results in the creation of a wheel which contains the dependency information, which would be used over the information available in the `EggInfoCache`. To be able to proceed to the next step, it is required that `EggInfoCache` is implemented. ##### Stage 3 The module `pip.resolve` will be modified and a class named `BacktrackingResolver` will be added to it. The class does what you expect it to do - it would resolve dependencies with recursive backtracking. As described above, there will be some global state stored about the packages that have been explored. Other than the maintenance of global state, in the form of the cache, the rest of the algorithm will essentially follow the same structure as any backtracking algorithm. The project would be complete when the aforementioned resolver is implemented. #### Existing Work There is existing work directly related to dependency resolution in pip, done by multiple individuals. - Robert Collins (un-merged closed pull request on pypa/pip) This has an incomplete backtracking dependency resolver and dependency caching. - Sebastien Awwad (branch on a separate fork) This was used for the depresolve project, to investigate the state of Dependency Resolution in PyPI/pip ecosystem. - `pip-compile` (separate project) This has a backtracking dependency resolver implemented to overcome pip's inablity to resolve dependencies. Their work would be used for reference, where appropriate, during the course of the project. Further, there are many package managers which implement dependency resolution, which would also be looked into. ### Tentative Timeline - Community Bonding Period: **5 May - 29 May** - Clean up and finalize my existing pull requests. - Read existing literature regarding dependency resolution. - Inspect existing implementations of dependency resolvers. GOAL: Be ready for the work coming up. - Week 1: **30 May - 5 June** - Introduce abstractions across pip's codebase to make refactoring `RequirementSet` easier. - Begin breaking down `RequirementSet.prepare_file`. - Week 2: **6 June - 12 June** - Continue working on the refactor of `RequirementSet`. - Week 3: **13 June - 19 June** - Continue working on the refactor of `RequirementSet`. - Finish and polish `pip.resolve`. GOAL: `pip.resolve` module will be ready, using the current resolution strategy. - Week 4: **20 June - 26 June** - Finish and polish all work on the refactor of `RequirementSet`. - Week 5: **27 June - 3 July** - Move code for `WheelCache` into a new `pip.cache` module. - Write tests for `pip.cache.EggInfoCache`, based on `WheelCache`. - Begin implementation of `pip.cache.EggInfoCache`. - Week 6: **4 July - 10 July** - Finish and polish `pip.cache.EggInfoCache`. GOAL: A cache for storing dependency information of sdists would be ready to add to pip. - Week 7: **10 July - 16 July** - Create a comprehensive set of tests for the dependency resolver. (in `tests/unit/test_resolve.py`) - Begin implementation on the backtracking algorithm. GOAL: A comprehensive test bed is ready for testing the dependency resolver. - Week 8: **17 July - 24 July** - Complete a rough implementation of the backtracking algorithm GOAL: An implementation of a dependency resolver to begin running tests on and work on improving. - Week 9: **25 July - 31 July** - Fixing bugs in dependency resolver - Week 10: **1 August - 6 August** - Finish and polish work on dependency resolver GOAL: A ready-to-merge PR adding a backtracking dependency resolver for pip. - Week 11: **6 August - 13 August** Buffer Week. - Week 12: **13 August - 21 August** Buffer Week. Finalization of project for evaluation. If the deliverable is achieved ahead of schedule, the remaining time will be utilized to resolve open issues on pip's repository in the order of priority as determined under the guidance of the mentors. #### Other Commitments I expect to not be able to put in 40 hour/week for at most 3 weeks throughout the working period, due to the schedule of my university. I will have semester-end examinations - from 10th May 2017 to 24th May 2017 - during the Community Bonding Period. My university will re-open for my third semester on 12 July 2017. I expect mid-semester examinations to be held in my University around 20th August 2017. During these times, I would not be able to put in full 40 hour weeks due to the academic workload. I might take a 3-4 day break during this period, regarding which I would be informing my mentor around a week in advance. I will be completely free from 30th May 2017 to 11 July 2017. ### Future Work There seems to be some interest in being able to reuse the above dependency resolution algorithm in other packaging related tools, specifically from the buildout project. I intend to eventually move the dependency resolution code that would come out of this project into a separate library to allow for reuse by installer projects - pip, buildout and other tools. ## Previous Contributions to pip (As on 12th March 2017) ### Issues Authored: - #3785 - Prefering wheel-based installation over source-based installation (Open) - #3786 - Make install command upgrade packages by default (Closed) - #3787 - Check if pip broke the dependency graph and warn the user (Open) - #3807 - Tests fail since change on PyPI (Closed) - #3809 - Switch to TOML for configuration files (Open) - #3871 - Provide a way to perform non-eager upgrades (Closed) - #4198 - Travis CI - pypy broken dues to dependency change in pycrypto (Closed) - #4282 - What's the release schedule? (Closed) Participated: - #59 - Add "upgrade" and "upgrade-all" commands (Open) - #988 - Pip needs a dependency resolver (Open) - #1056 - pip install -U does not remember whether a package was installed with --user (Open) - #1511 - ssl certificate hostname mismatch errors presented badly (Open) - #1668 - Default to --user (Open) - #1736 - Create a command to make it easy to access the configuration file (Open) - #1737 - Don't tell the user what they meant, just do what they meant (Open) - #2313 - Automated the Creation and Upload of Release Artifacts (Open) - #2732 - pip install hangs with interactive setup.py setups (Open) - #3549 - pip -U pip fails (Open) - #3580 - Update requests/urllib3 (Closed) - #3610 - pip install from package from github, with github dependencies (Open) - #3788 - `pip` version suggested is older than the version which is installed (Closed) - #3789 - Error installing Mayavi in Mac OS X (Closed) - #3798 - On using python -m pip install -upgrade pip.. its throwing an error like the one below (Closed) - #3811 - no matching distribution found for install (Closed) - #3814 - pip could not find a version that satisfies the requirement oslo.context (Closed) - #3876 - support git refs in @ syntax (Open) - #4021 - Will you accept PRs with pep484 type hints? (Open) - #4087 - pip list produces error (Closed) - #4149 - Exception thrown when binary is already linked to /usr/local/bin (Open) - #4160 - Pip does not seem to be handling deep requirements correctly (Open) - #4162 - Let --find-links be context aware to support github, gitlab, etc. links (Open) - #4170 - pip list |head raises BrokenPipeError (Open) - #4182 - pip install should install packages in order to avoid ABI incompatibilities in compiled (Open) - #4186 - IOError: [Errno 13] Permission denied: '/usr/local/bin/pip' (Open) - #4206 - Where on Windows 10 is pip.conf or pip.ini located? (Closed) - #4221 - Feature request: Check if user has permissions before downloading files (Closed) - #4229 - "pip uninstall" is too noisy (Open) #### Pull Requests Authored: - #3806 - Change install command's default behaviour to upgrade packages by default (Closed, superseded by #3972) - #3808 - Fix Tests (Merged) - #3818 - Improve UX and tests of check command (Merged) - #3972 - Add an upgrade-strategy option (Merged) - #3974 - [minor] An aesthetic change to wheel command source (Merged) - #4192 - Move out all the config code to a separate module (Merged) - #4193 - Add the ability to autocorrect a user's command (Open) - #4199 - Fix Tests for Travis CI (Merged) - #4200 - Reduce the API exposed by the configuration class (Merged) - #4232 - Update documentation to mention upgrade-strategy (Merged) - #4233 - Nicer permissions error message (Open) - #4240 - [WIP] Add a configuration command (Open) Participated: - #2716 - Issue #988: new resolver for pip (Closed) - #2975 - Different mock dependencies based on Python version (Merged) - #3744 - Add a "Upgrade all local installed packages" (Open) - #3750 - Add a `pip check` command. (Merged) - #3751 - tox.ini: Add "cover" target (Open) - #3794 - Use the canonicalize_name function for finding .dist-info (Merged) - #4142 - Optionally load C dependencies based on platform (Open) - #4144 - Install build dependencies as specified in PEP 518 (Open) - #4150 - Clarify default for --upgrade-strategy is eager (Merged) - #4194 - Allow passing a --no-progress-bar to the install script to surpress progress bar (Merged) - #4201 - Register req_to_install for cleanup sooner (Merged) - #4202 - Switch to 3.6.0 final as our latest 3.x release (Merged) - #4211 - improve message when installing requirements file (#4127) (Merged) - #4241 - Python 3.6 is supported (Merged) ## References 1. [pypa/pip#988](https://github.com/pypa/pip/issues/988) Tracking issue for adding a proper dependency resolver in pip. Contains links to various useful resources. 1. [pypa/pip#2716](https://github.com/pypa/pip/issues/2716) Prior work by Robert Collins for adding a proper dependency resolver in pip. 1. [Python Dependency Resolution]( https://docs.google.com/document/d/1x_VrNtXCup75qA3glDd2fQOB2TakldwjKZ6pXaAjAfg/edit?usp=sharing ) A writeup by Sebastian Awwad on the current state of dependency resolution in pip and PyPI in general. 1. [PSF Application Template]( https://wiki.python.org/moin/SummerOfCode/ApplicationTemplate2017) For guidance on how to write the application and what information is needed. 1. [Stork: Secure Package Management for VM Environments]( http://isis.poly.edu/~jcappos/papers/cappos_stork_dissertation_08.pdf) A Paper by Justin Cappos about Stork, used for reference regarding Backtracking Resolution
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig