Re: [Distutils] "Python Package Management Sucks"

Toshio Kuratomi Wed, 01 Oct 2008 11:02:40 -0700

You guys are fairly into you debate so hopefully I don't interject
something that's already been gone over :-)

Chris Withers wrote:
> Matthias Klose wrote:
>>>> Install debian and get back to productive tasks.
>>> This is an almost troll-like answer.
>>> See page 35 of the presentation.
>>
>> I disagree. You could think of "Packages are Pythons Plugins" (taken
>> from page 35) as a troll-like statement as well.
> 
> You're welcome to your (incorrect) opinion ;-)
> Debian packages could just as easilly be seen as Debian's pluggins.
> 
For a *very* loose definition of plugin, perhaps.  But if you look at:
  http://en.wikipedia.org/wiki/Plugin

the idea of Debian packages being plugins is a pretty far stretch.  The
idea of Packages being python plugins is less of a stretch but I'd call
it an analogy.  It's useful for looking on things in a new light but if
we start designing a plugin interface and only viewing packages through
that definition I think we'll be hindering ourselves.

>>> - all the package management systems behave differently and expect
>>> packages to be set up differently for them
>>
>> correct, but again they share common requirements.
> 
> ...but all have different implementations.
> 
The common requirements are more important than the varying
implementations when thinking about the metadata and how flexible things
need to be.  When justifying the need for a separate python build tool
and distribution format, realizing that there's different
implementations is good.  ie: we need to expose package naming,
versioning, and dependencies to outside tools because they have a common
need for that information on the one hand.  We have to realize that
there's a need for both run-from-egg and run-from-FHS-locations on the
other.

>> some people prefer to name this "stable releases" instead of
>> "bitrot". 
> 
> I'll call bullshit on this one. The most common problem I have as a
> happy Debian user and advocate when I go to try and get help for a
> packaged application (I use packages because I perhaps mistakenly assume
> this is the best way to get security-fixed softare), such as postfix,
> postgres, and Zope if I was foolish enough to take that path, is "why
> are toy using that ancient and buggy version of the software?!" shortly
> before pointing out how all the issues I'm facing are solved in newer
> (stable) releases.
> 
> The problem is that first the application needs to be tested and
> released by its community, then Debian needs to re-package, patch,
> generally mess around with it, etc before it eventually gets a "Debian
> release". It's bad enough with apps with huge support bases like
> portgres, imagine trying to do this "properly" for the 4000-odd packages
> on PyPI...
> 
You're correct in the results you're seeing but not in the reason that
it exists.  There are many linux distributions and each has a different
policy of how to update packages.  The reason for the variety is that
there's demand for both fast package updates and slow package updates.
The Debian Stable, Red Hat Enterprise Linux, and other stable,
enterprise-oriented distributions' aim is to provide a stable base on
which people can build their applications and processes.  A common
misperception among developers who want faster cycles is that the base
system is just a core of packages while things closer to the leaves of
the dependency tree could be updated (ie: don't update the kernel; do
update the python-sqlalchemy package).  What's not seen is that these
distributions are providing the base for so many people that updates
that change the API/ABI/on-disk format/etc are likely to break *someone*
out there.  You want to be using one of these systems if you have
deployed a major application that serves thousands of people and can
afford little to no downtime because you can be more assured that any
changes to the system are either changes that are overwhelmingly
necessary and the API/ABI breakage has been reduced as much as possible
or changes that you yourself have introduced.

For system administrators it can also be frustrating due to knowing that
there's been bug fixes that are not supposed to change backwards
compatibility in newer upstream packages.  The problem here is that we
all know that all software has bugs.  The risk with an update to a newer
stable version of software is that the new software has bugs that are as
bad or worse than the old one.  The package maintainers have to evaluate
how many changes have gone into the new version of the software and how
big the current problem is and then apply the distribution's policy on
updates to that.  For a stable enterprise-oriented distro, it's often a
case of "better the devil you know than the devil you don't".

For a developer of software or someone deploying a new system (as
opposed to someone who's had one deployed for several years before they
hit a certain bug), this can be quite frustrating as you know that there
are fixes and features in newer versions of the software.  When you have
the choice, then, you should use one of the other Linux distributions
either whose focus is on staying closer to what upstream is shipping
(I'd recommend this for developers) or one which has a stable policy but
has released closer to the current date with newer packages.  When you
don't have a choice, you have to be prepared for the possibility that
you will need to install the requirements for your app from another
resource (this could be from another version that the distribution
supports like installing debian backports or installing from source or
installing an egg).  Remember though, that sometimes the distribution
will update a package for you if you just request it.  It depends on the
severity of what's broken currently, the risks involved with updating,
and the distribution (and maintainer's) policies/perceptions of the risk
vs reward.

>> Speaking of extensions "maintained by the entity originating the
>> python package": this much too often is a way of bitrot. is the
>> shipped library up to date? does it have security fixes? how many
>> duplicates are shipped in different extensions? does the library need
>> to be shipped at all (because some os does ship it)?
> 
> So what do you propose doing one projectA depends on version 1.0 of libC
> and projectB depends on version 2.0 of libC?
> 
This is a problem that is not new for distributions.  Each one handles
it slightly differently.  For Fedora, we've decided the best course is
to help upstream port to newer versions of the library.  However, since
this isn't always practical, we sometimes introduce compatibility
packages which have the old version of libraries so older programs will
continue to work.

Having multiple versions not ideal as this is where bitrot sets in in
earnest.  If upstream for libC only supports version 2.0 and a security
flaw comes out that affects both libC-1.0 and libC-2.0 then we have to
fix libC-1.0 at the distribution level.  This is more work for us to
support something outdated.  We'd much rather do work that has a future
upstream by porting the application to the newer version.  And the time
to do that is *not* when there's a security flaw that has to be fixed
yesterday.

The exact wrong-thing to do (and prohibited in policy in most
distributions) is for the applications to have their own copies of the
libraries.  When a security flaw comes out in that case, we'd have to:

1) hunt through the all the packages we ship to find any that are affected.
2) update the various versions in all of those packages which might mean
we have to generate multiple different fixes.
3) rebuild those packages and force our users to redownload all of them.

If we had separate library packages for the separate versions we'd:
1) know exactly which packages had to be fixed
2) Only have to apply fixes once to the versions that we were shipping
3) have our users only download the library packages as the applications
will load the fixed version from the system.

I can go on with other reasons why this is a bad idea and how to
mitigate problems but if you're convinced already, I'll surrender the
soapbox to someone else :-)

>> Considering an extension interfacing a library
>> shipped with the os, you do want to use this library, not add another
>> copy. 
> 
> libxml2 seems to be agood example to use here...
> 
> I guess on debian I'd need to likely install libxml2-dev before I could
> install the lxml package...
> 
Note: I'm a Fedora dev, not a Debian dev but the packaging techniques
are similar in generalities.  You should just be able to request that
lxml be installed and it will automatically pull in libxml2.
libxml2-dev shouldn't enter the picture as a python program that imports
lxml won't need the C headers.

(Unless you're talking about *building* lxml which is a separate problem.)

> ...what about MacOS X?
> 
> ...what about Windows?
> 
Are you going to be distributing a separate version for MacOS X and
Windows anyway since the norm is not to compile from source on those
platforms?  Then you're already at the point where you have multiple
packages for different OS's.  A source tarball for unix distributors and
a binary zip/binhex/what have you for MacOSX and Windows.

>> An upstream
>> extension maintainer cannot provide this unless he builds this
>> extension for every (os) distribution and maintains it during the os'
>> lifecycle.
> 
> ...or just says in the docs "hey, you need libxml2 for this, unless
> you're on Windows, in which case the binary includes it".
> 
>>  - os distributors usually try to minimize the versions they include,
>>    trying to just ship one version.  
> 
> ...which is fair enough for the "system python", but many of us have a
> collection of apps, some of which require Python 2.4, some Python 2.5,
> and on top of each of those, different versions of different packages
> for each app.
> 
> In my case, I do source (alt-)installs of python rather than trusting
> the broken stuff that ships with Debian and buildout to make sure I get
> the right versions of the right packages for each project.
> 
So this is fine to a certain extent.

Pros:
* Allows you to develop new applications using known good or latest
versions of other software.
* Allows you to deploy an app using newer-than system libraries on an
otherwise stable-class distribution.

Cons:
* You become responsible for the code of all the components your
installing.  If there's a bug in your alt-install of lxml, you're the
one that has to fix it rather than the linux distribution.
* If you're distributing this so that everyone can use it, the os
packagers are going to have to make sure that the code works with their
versions and might have to do porting work.

The first Con is the more important one for me.

>>  - setuptools has the narrow minded view of a python package being
>>    contained in a single directory, which doesn't fit well when you
>>    do have common locations for include or doc files. 
> 
> Python packages have no idea of "docs" or "includes", which is certainly
> a deficiency.
> 
+1.

I know I've mentioned paver before but one of the things that it does
right is making the declarative metadata extensible.  Whereas you can't
simply add a new piece of metadata to setup.py's setup() you can add a
new Bunch() of metadata in a paver pavement.py file without any other
code.  This makes it easy to do the right thing and write code to
operate on "docs", "includes", "locales", etc that you've defined
declaratively in the metadata section.

>> way packaging the python module with rpm or dpkg. E.g. namespace
>> packages are a consequence how setuptools distributes and installs
>> things. Why force this on everybody?
> 
> being able to break a large lump (say zope.*) into seperate
> distributions is a good idea, which setuptools implements very badly
> using namespace packages...
> 
>> A big win could be a modularized setuptools where you are able to only
>> use the things you do want to use, e.g.
>>
>>  - version specifications (not just the heuristics shipped with
>>    setuptools).
> 
> not sure what you mean by this.
> 
I'm not 100% certain of what Matthias means but there's several problems
with seutptools usage of versions:

1) The heuristic encourages bad practices.  Versions need to be parsed
by computer programs (package managers, scripts that maintain
repositories, etc).  Not all of those are written in python.  Having
things other than letters and dots in version strings is problematic for
these programs.  For instance, here's something that setuptools
versioning heuristics allow you to do:

foo-1.0rc1
foo-1.0
foo-1.0post1

But here's how rpm would order it:
foo-1.0
foo-1.0post1
foo-1.0rc1

In Fedora we have rules for puting non-numeric things in our release tag
to work around this:

version: 1.0 , release: 0.1.rc1
version: 1.0 , release: 1
version: 1.0 , release: 2.post1

This is not all inclusive, but you can see, we have to move the alpha
portion of the version to the release to ensure that the upgrade path
will move forward sensibly.

2) This is more important but much harder.  Something that would really
help everyone is having a way of versioning API/ABI.  Right now you can
specify that you depend on Foo >= 1.0 Foo <= 2.0.  But the version
numbers don't have meaning until the actual packages are released.  If
Foo-1.0 and Foo-1.1 don't have compatible API, your numbers are wrong.
If Foo-1.0 is succeeded by Foo-2.0 with the same API your numbers are
too restrictive.  If you lock the versions to only what you've tested:
Foo = 1.0 then you're going to have people and distributions that want
to use the new version but can't.  Some places have good versioning rules::
  https://svn.enthought.com/enthought/wiki/EnthoughtVersionNumbers

Other places say they have marketing departments that prevent that One
possibility would be to have MyLib1-1.0, MyLib2-1.0, MyLib2-2.0, etc
with the version for marketing included in the package name.

Another idea would be to have API information stored in metadata but not
in the package name.  That way marketing can have a big party for
MyLib-2.0 but the API metadata has API_Revision: 32.

>>  - specification of dependencies.
>>
>>  - resource management
> 
> ?
> 

http://peak.telecommunity.com/DevCenter/PythonEggs#accessing-package-resources

I have no love for how pkg_resources implements this (including the API)
but the idea of retrieving data files, locales, config files, etc from
an API is good.  For packages to be coded that conform to the File
Hierachy Standard on Linux, the API (and metadata) needs to be more
flexible.  We need to be able to mark locale, config, and data files in
the metadata.  The build/install tool needs to be able to install those
into the filesystem in the proper places for a Linux distro, an egg,
etc.  and then we need to be able to call an API to retrieve the
specific class of resources or a directory associated with them.

use cases:

* config files go to /etc on Linux and we'd want to retrieve the
contents of /etc/configfile
* generic, architecture-independent data files go under /usr/share/.
We'd want to place them in or under /usr/share/$PACKAGENAME.  Mostly
we're going to want to retrieve the contents of a specific data file.
* locale files go under /usr/share/locale/ (ex:
/usr/share/locale/en_US/LC_MESSAGES/compiz.mo)  We'll want to retrieve
the directory '/usr/share/locale' for feeding to gettext.

>>  - a module system independent from any distribution specific stuff.
> 
> ?
I read this as "entry_points is a good feature".

> 
>>  - any other distribution specific stuff.
> 
> ?
> 
I think Matthias is trying to separate out the different services that
setuptools provides so that they can be decoupled and worked on
separately.  So "other distribution specific stuff" would be things to
do with distributing the results of your labors.  eggs and pypi would
fall under this.

Matthias, if I'm wrong in any of this, please correct me :-).  These are
my perceptions due to them being the issues I have as a pakckger for a
different distribution.

-Toshio

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Distutils-SIG maillist  -  Distutils-SIG@python.org
http://mail.python.org/mailman/listinfo/distutils-sig

Re: [Distutils] "Python Package Management Sucks"

Reply via email to