Re: Terrible performance of Python dependency generator

2015-11-23 Thread Elan Ruusamäe

On 22.11.2015 22:39, Jacek Konieczny wrote:

/usr/lib/rpm/pythoneggs.py is used to find the dependencies and it is
not that slow by itself… but it is called twice (Provides + Requires)
for each file in /usr/share/pythonX.Y. And big Python packages have lots
of files there. Most of them not adding any extra dependency
information.
i tried once to make hack to that, when facing similar problem with php 
dependencies generator.


the idea was simple:
1. first time dep generator is invoked, it becames daemon and starts 
listening to unix socket

2. further calls talk to socket instead dispatching the "requests"

but i never finished it, don't know if i have some WIP saved somewhere.

and with rpm4.5 the python dep genreator just compared pythonX.Y 
version  to do print "python(abi) %s" print

that was "optimized" by providing PY_VER as env var.

--
glen

___
pld-devel-en mailing list
pld-devel-en@lists.pld-linux.org
http://lists.pld-linux.org/mailman/listinfo/pld-devel-en


Re: Terrible performance of Python dependency generator

2015-11-23 Thread Jeffrey Johnson

> On Nov 23, 2015, at 4:16 AM, Jacek Konieczny  wrote:
> 
> On 2015-11-22 22:03, Jeffrey Johnson wrote:
>> Dependencies are automatically generated only for executable files.
> 
> That is not true for Python dependencies and this would not work for
> Python dependencies.
> 

(aside)
“Only executable files”  SHOULD be true for all automated dependencies imho,
as that is what rpm dependencies were originally designed for, to verify that
executables had all necessary prerequisites. YMMV, everyone’s does.

> There are two useful types of Python dependencies:
> 
> 1. python(abi) – this is extracted from .pyc or .pyo files. These are
> not the executable scripts, but non-executable library files in /usr/lib
> or /usr/share. Checking a single *.py[co] file would do for the whole
> package. On the other hand, this dependency is a bit redundant, because
> files for each python abi are going to a different directory and the
> directory dependency should be enough.
> 
> 2. pythonegg(*) – this are extracted from meta-data in *.egg-info
> directories. A package usually contains only one such directory.
> 
> Currently it works as all /usr/{lib*,share}/pythonX.Y/* files are passed
> to pythoneggs.py. Among this file there would be some *.pyc and some
> file from the egg-info directory, so all the important dependencies
> would be extracted.
> 
> Examining only the executables would return only the '/usr/bin/python',
> or even '/bin/sh' dependency.
> 
> I guess I will hack rpmfc.c to run Python helper only for a single
> py[co] file and a single file in every egg-info directory.
> 

Whatever works for you …

>> So
>> using %files -f manifest, one can make a pass in %install to generate
>> the manifest, and doing both
>>  1) add a %attr marker to set the execute bits
>>  2) chmod -x on the file in %buildroot
>> 
>> and then generate dependencies manually (using a two pass build to
>> edit Requires: etc into the spec file.
> 
> Sounds like a very ugly hack.
> 

Yep.

> BTW we don't need a manifest to preserve proper file permissions as in
> PLD we _always_ provide permissions explicitly in %files. So we could
> just chmod -R a-x all the Python files. But that is not what file
> permissions are for!
> 

(aside)
There are other benefits to a manifest, particularly when filtering
large trees of files (which you surely have with drupal) to split
into sub packages. But you can package however you wish.

>> The better fix would be to use the embedded python interpreter yo
>> avoid repeatedly involving a shell that invokes python.
> 
> That wouldn't work much better than no repeat a stupid check for each file.
> 

Its not the check, but the overhead of invoking python for every file, that
you are seeing.

>> Bur the fundamental problem is with user overridable external
>> helper scripts that conform to ancient expectations of the helper API
>> and still must classify files and generate cross referenced tag data
>> dynamically.
> 
> The 'ancient expectations of the helper API' actually made some sense in
> terms of performance (single process to handle a file list). Executing
> any external process for every file is plain stupid.
> 

Yes the ancient API was dirt simple and was preserved. The metadata
has changed so that the dependencies are attached to each file in a package
is what becomes problematic.

The original API was a single shell script … these days there are
too many types of dependencies to handle in ne single shell script.

> And Python (and probably not only Python) dependencies are not per-file,
> but per python package. Linking dependencies checks to specific files is
> quite artificial.
> 

We disagree here. There is functionality within rpm that disables dependencies
attached to a file when that file is excluded.

Of course you can put every file in its own package and choose not to
install that package to achieve the same effect.

But automatic dependencies are a file attribute carried in package metadata,
including pythonegg(…), not a package attribute imposed on the files within.

73 de Jeff
> Jacek
> ___
> pld-devel-en mailing list
> pld-devel-en@lists.pld-linux.org
> http://lists.pld-linux.org/mailman/listinfo/pld-devel-en

___
pld-devel-en mailing list
pld-devel-en@lists.pld-linux.org
http://lists.pld-linux.org/mailman/listinfo/pld-devel-en


Re: Terrible performance of Python dependency generator

2015-11-23 Thread Jacek Konieczny

On 2015-11-22 22:03, Jeffrey Johnson wrote:

Dependencies are automatically generated only for executable files.


That is not true for Python dependencies and this would not work for
Python dependencies.

There are two useful types of Python dependencies:

1. python(abi) – this is extracted from .pyc or .pyo files. These are
not the executable scripts, but non-executable library files in /usr/lib
or /usr/share. Checking a single *.py[co] file would do for the whole
package. On the other hand, this dependency is a bit redundant, because
files for each python abi are going to a different directory and the
directory dependency should be enough.

2. pythonegg(*) – this are extracted from meta-data in *.egg-info
directories. A package usually contains only one such directory.

Currently it works as all /usr/{lib*,share}/pythonX.Y/* files are passed
to pythoneggs.py. Among this file there would be some *.pyc and some
file from the egg-info directory, so all the important dependencies
would be extracted.

Examining only the executables would return only the '/usr/bin/python',
or even '/bin/sh' dependency.

I guess I will hack rpmfc.c to run Python helper only for a single
py[co] file and a single file in every egg-info directory.


So
using %files -f manifest, one can make a pass in %install to generate
the manifest, and doing both
1) add a %attr marker to set the execute bits
2) chmod -x on the file in %buildroot

and then generate dependencies manually (using a two pass build to
edit Requires: etc into the spec file.


Sounds like a very ugly hack.

BTW we don't need a manifest to preserve proper file permissions as in
PLD we _always_ provide permissions explicitly in %files. So we could
just chmod -R a-x all the Python files. But that is not what file
permissions are for!


The better fix would be to use the embedded python interpreter yo
avoid repeatedly involving a shell that invokes python.


That wouldn't work much better than no repeat a stupid check for each file.


Bur the fundamental problem is with user overridable external
helper scripts that conform to ancient expectations of the helper API
and still must classify files and generate cross referenced tag data
dynamically.


The 'ancient expectations of the helper API' actually made some sense in
terms of performance (single process to handle a file list). Executing
any external process for every file is plain stupid.

And Python (and probably not only Python) dependencies are not per-file,
but per python package. Linking dependencies checks to specific files is
quite artificial.

Jacek
___
pld-devel-en mailing list
pld-devel-en@lists.pld-linux.org
http://lists.pld-linux.org/mailman/listinfo/pld-devel-en


Terrible performance of Python dependency generator

2015-11-22 Thread Jacek Konieczny
Hi,

We will probably need to rebuild the python-* packages again and I
already hate that. Such python-django takes 45 minutes to build and most
of that is in the auto-dependency generator. That is insane! It should
not take that long!

/usr/lib/rpm/pythoneggs.py is used to find the dependencies and it is
not that slow by itself… but it is called twice (Provides + Requires)
for each file in /usr/share/pythonX.Y. And big Python packages have lots
of files there. Most of them not adding any extra dependency
information.

That is strange, as the dependency helpers accept list of file names on
their stdout… and RPM (in lib/rpmfc.c) always feeds them with one
filename only. Why is that?

I can even see a buffer for a file list in the code (iob_python in the
rpmfc_s struct), but it seems not used.

I tried to invent some smart hack to limit number of files examined –
usually checking a single *.py file and the *.egg-info/PKG-INFO should
be enough, but I was not able to inject this in the weird rpmfc logic.
And I do not quite understand what it is supposed to do (what are those
'colors' and what files should be python-colored).

Can this be fixed somehow? How have we ended with this?

Jacek
___
pld-devel-en mailing list
pld-devel-en@lists.pld-linux.org
http://lists.pld-linux.org/mailman/listinfo/pld-devel-en