Re: [pkg-discuss] [review] package performance improvements

Edward Pilatowicz Wed, 22 Feb 2012 18:28:40 -0800

On Wed, Feb 22, 2012 at 02:04:03PM -0800, Shawn Walker wrote:
> On 02/21/12 17:46, Edward Pilatowicz wrote:
> >On Wed, Feb 15, 2012 at 09:04:03AM -0800, Shawn Walker wrote:
> >>Greetings,
> >>
> >>The following webrev contains general changes to improve the overall
> >>performance of the package system:
> >>
> >>   7145683 explore general pkg performance improvements
> >>
> >>webrev:
> >>   https://cr.opensolaris.org/action/browse/pkg/swalker/pkg-7145863/webrev/
> >>
> >>
> >
> >i'm really impressed with all the performance you've squeezed out of
> >this code.  here are my comments:
> >
> >----------
> >general:
> ...
> >- in the .c files, since it takes me a while to remember how python c
> >   bindings work, it'd be nice to have some comments that say:
> >     in the case of an error we call "return (NULL)", which will
> >     generate an exception.  in the case of success we call
> >     Py_RETURN_NONE.
>
> That's not quite right though; the calling conventions differ
> depending on the routine and situation involved.
>
> Returning (NULL) by itself won't generate an exception.  If an
> exception is already set, or in other error scenarios, a caller is
> expected to return (NULL).  But again, even that isn't always true
> (for example, in module init).
>
> As for success, no, the value returned is dependent on the function.
> It's just that functions are expected in non-error scenarios to
> return *something* other than (NULL) (again this isn't always true,
> for things like module init).  In these particular cases, since
> something has to be returned, and None would be the normal default.
>


yeah.  i was more specifically thinking about the init routines where
this holds true.  (as mentioned more below, any comment hints or
reminders about how these functions should be working is always
appreciated.)

> >- i have a bit of concern that some of your performance gains could be
> >   slowly lost in the future due to innocuous looking code changes.  i
> >   wonder if perhaps the test suite should have an automated mechanism
> >   for running benchmarks and comparing them against a known baseline?
> >   (then perhaps these benchmarks could be run regularly via hudson for
> >   every putback).
>
> Last I heard, perfpit is already tracking our performance.  (They've
> certainly filed bugs on regressions in the past ...)
>
> This something we should discuss more generally for the project.  I
> share the same concerns.
>

gtn

> >----------
> >src/modules/actions/_actions.c
> >
> >- in general, i really prefer data driven code.  so for example, in both
> >   init_actions() and fromstr() you have hard coded all the action names.
> >   i'd would find it cleaner if there was one sorted array of action
> >   names, and that array was used in init_actions() to generate a table
> >   of pointers that could subsequently be used in fromstr().
>
> That sounds like it would be slower than what I'm doing now for
> little gain.  Our action types at this point change so infrequently
> that I'm not inclined to optimise for that sort of maintenance.
> Especially since the action string parsing code is now in a single
> source file.
>
> Is there some other benefit I'm missing to doing this?
>
> It certainly sounds more complex than what I'm doing now.
>
> I'll stop by to ask about this later...
>

wrt performance, you'd have to measure it to see if there's a
difference.  wrt benefit, i guess it's more a style thing.  i prefer
less code that's data driven, imo it makes testing and maintenance
easier.

> >- in init_actions(), why Py_DECREF() the error classes right after
> >   looking them up?  (seems you'd want to keep a reference to them since
> >   you still have globals pointing to them.
>
> So Python 2.x unfortunately has no module shutdown/cleanup hooks [1].
>
> Since I know that these objects won't be garbage collected until the
> actions module is unloaded, and since I can't DECREF them when
> _actions is unloaded, I simply DECREF them ahead of time so garbage
> collection on exit is clean.
>

please add a comment explaining this.

> >----------
> >src/modules/actions/_common.c
> >
> >- in set_invalid_action_error(), instead of calling:
> >
> >     PyObject_GetAttrString(pkg_actions,<class name>)
> >
> >   for each error class when you go to generate an error, i'd recommend
> >   looking up all the error error class in init_common() ahead of time.
> >   the reason is that these errors paths are infrequently used so moving
> >   some of this code to init_common() will make it easier to catch any
> >   errors in the code.  (which makes maintenance in the future easier as
> >   well.)
>
> I could see that except that we have test cases for every one of
> these errors (and should continue to), so I don't see any real
> benefit to looking them up ahead of time.  Certainly not concerned
> about the performance aspects for this path.  I'd still have to
> lookup the errors in a data structure or do an if/else if
> comparison, so that doesn't seems simpler either.
>

perhaps because it makes it more like the _generic.c module that looks
up globals ahead of time?  :)

> >- in set_invalid_action_error(), does the call to PyErr_SetObject() end
> >   up invoking a __init__() routine for the exc type object?  if so it'd
> >   be great to get some comments stating that.
>
> It would have to call __init__ to be able to construct an exception object.
>
> I'm uncertain of the utility of documenting the calling conventions
> of the standard CPython library
>
...
> With that said, I'll add a comment describing what this section is
> doing like this (above the call to PyObject_GetAttrString):
>

thanks.  i really haven't spent a lot of time writing CPython, and i'm
assuming that the folks who will end up maintaining this code will
probably fall into the same boat.  for this review i ended up repeatedly
looking up each of these function calls to try and verify what was going
on, so any extra context in the comments is greatly appreciated.

> >----------
> >src/modules/actions/generic.py
> >
> >- if "hardlinks must follow all filesystem-modifying actions", why are
> >   they followed by "link" actions.
> >
> >- user/group actions seem like filesystem modifying actions (since they
> >   update files) so why are they before "dir" actions?  (or perhaps i
> >   just don't understand the comment?)
>
> For the record; I didn't change the order, but I agree the comment or
> the order should match the other.
>
> The user/group actions must be before dir/file actions so we can
> chown the file/dir actions to the user/group that we might be
> delivering at the same time.
>
> Since I don't have a deathwish, I'm going to leave the order alone,
> but here's the new comment:
>
> # Directories must precede all filesystem object actions;
> # hardlinks must follow all filesystem object actions (except links).
> # Note that usr/group actions precede file actions; this implies that
> # /etc/group and /etc/passwd file ownership needs to be part of initial
> # contents of those files.
>

assuming i understand this now, could you update this to something like:

# Directories must precede all filesystem object actions;
# hardlinks must follow all filesystem object actions (except links).
# Note that usr/group actions precede file actions (so that we know
# about all the users and groups that other actions may be delivering
# content for); this implies that /etc/group and /etc/passwd file
# ownership needs to be part of initial contents of those files.

> >- in set_data(), the first thing it does is:
> >     if data is None:
> >   but _generic_init() won't invoke set_data() if data == None.
>
> Correct, _generic_init() doesn't bother calling set_data() if it
> doesn't need to as that would be much slower.  I've added a comment
> to _common.c:_generic_init.
>

you could also remove "if data is None:" from set_data() (or replace it
with an "assert data != None").

> >- in get_varcet_keys(), walking through self.attrs twice speeds things
> >   up?
>
> Yes, welcome to my world of Python performance craziness:
>
...
> Are you utterly terrified yet?  I've add a comment explaining why
> it's written that way.
>

absolutely.  (thanks for the comment.)
if only python had a -x0666 option...

> >- in consolidate_attrs(), i thought you removed the ability for
> >   attribute values to be sets?  (it seems you did in __str__().)
>
> Per the comments in other places, this ensures that the attribute
> values for the variants are converted to lists from sets.  (See
> comment in modules/flavor/base.py as an example).
>
> If you'd prefer that we never provide set objects for the attribute
> values so there's not even the slightest potential for a window
> where we could fail, I'm willing to change modules/flavor/base.py
> and modules/publish/dependencies.py to instead simply convert the
> attributes themselves ahead of time.
>

i'm still a bit confused.

in consolidate_attrs(), we used to check for property values that were a
list.  you updated this so that we check for lists or sets.  (implying
to me that attributes can be lists or sets.)

but in __str__(), we used to check if property values were a list or
sets, but you updated this so that now we only check for lists.  (and it
seems like __str__() will fail if it encounters a attribute which is a
set.)

this seems inconsistent to me.  from your comments it sounds like
attributes can indeed be sets, but does this mean there is some
undocumented convention that users must call consolidate_attrs() to
convert all the attributes to lists before using other functions like
__str__()?  (if this is the case perhaps there should be a local
"consolidated" variable within the action class so it can self
consolidate when necessary?)

> >----------
> >src/modules/catalog.py
> >
> >- in general, feel free to call me paranoid, but putting function calls
> >   in try blocks like this worries me:
> >
> >     try:
> >             yield f, self.__gen_actions(f, entry["actions"],
> >                 excludes)
> >     except KeyError:
> >             ...
> >
> >   the reason is that if any code in __gen_actions() generates a KeyError
> >   (which is a pretty common exception) we'll unexpectedly catch that
> >   here.  i'd much rather see something like:
> >
> >     try:
> >             actions = entry["actions"]:
> >     except KeyError:
> >             ...
> >     else:
> >             yield f, self.__gen_actions(f, actions, excludes)
>
> I'm not actually worried about this; this is why we have extensive
> test cases for catalog.py after all...
>
> Not only that, Python places a fair performance penalty on you if
> you use the else: clause.  (See earlier comments.)  At that point,
> I'd be better off reverting these changes completely.
>

boo (wrt python else performance).

> >----------
> >src/modules/client/imageplan.py
> >
> >- in __get_directories(), why does it matter in which order you update
> >   the "dirs" set?
>
> It doesn't matter in which order I update, but it does matter how
> many items I pass to update().  Constructing a set() object using a
> generator with potentially many items is faster than calling update
> with a set of many items that are needed to expand the set().
>
> So I call update() with the smaller, constant set of data.  Again,
> we're talking about milliseconds here 8)
>

how about adding a comment which says:

# it's faster to build a large set and make a small update to it than to
# do the reverse.

> >- why bother with:
> >     aname = act.name
> >     aattrs = act.attrs
> >     ...
> >             oattrs = oact.attrs
>
> See earlier example in get_varcet_keys() above; we avoid a few extra
> attribute retrievals since Python doesn't optimise this for us.  So
> often, assigning a reference to an attribute and then re-using that
> is faster.
>

in that case, there's another set of nested for loops in the same
function just below this one that seems like it could use the same
optimization for the same variables

> >----------
> >src/modules/flavor/base.py
> >src/modules/publish/dependencies.py
> >
> >- is there any reason not to always invoke consolidate_attrs implicitly
> >   from the generic Action __init__() method (or perhaps just the
> >   DependencyAction __init__() method)?
>
> Other than the substantially unnecessary overhead?  See my earlier
> explanation above.
>

as per my reply, i guess i'm still confused as to the purpose of this
consolidation...

ed
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Re: [pkg-discuss] [review] package performance improvements

Reply via email to