Re: The Gump3 branch

Leo Simons Sun, 09 Jan 2005 06:24:49 -0800

On 08-01-2005 15:21, "Adam R. B. Jack" <[EMAIL PROTECTED]> wrote:
>> Phew, have I been busy :-D.
> 
> You certainly have.


Ooh, long e-mail! I'm gonna try and split this up... :-D

Inter-component-communication
-----------------------------
> I'm sure your IOC/container experiences have required you to answer this
> before, but how do you allow components to communicate/collaborate?

I firmly believe there is very little need for different components to
communicate. If you architect things the IOC way, components will use just
one or two other components, and their parent can just set up the references
between all those components.

What will happen is that a component needs a certain kind of result
available. For example, something that pushes information in the dynagump
database needs that information, which might be put there by an ant builder
or something like that. This kind of stuff is trivial in python; you just
set the property on the relevant part of the model and then retrieve it
later.

Note that such communication is pretty indirect. For example the start of
the CvsUpdater plugin I did just pushes information into the model (the log
of the cvs command, exit status, etc) without worrying who uses that
information (at the moment, it is just ignored).

> There
> were times when building logic wanted to know something historically (had
> this built before, etc.) in order to determine how much effort (or what
> switches) to use. Is inter-component communications like this a real no-no,
> or is this something that might be "coincidentally" allowed via steps in
> pre-processing, etc.

We don't need "steps". Think unix command line utilities. You can make them
communicate:

  find . -type f | xargs -v ".svn"

Without steps. That "|" there in gump is achieved by setting a property on a
piece of the model.

Threading
---------
> Do you think we have a chance to re-instate threading in this model? [It is
> a minor nit, not a show stopper, but I liked the large run-time reduction of
> concurrent checkouts.]

Yes. We can probably reuse the worker code from gump2. I left it out on
purpose because it was clouding the gump2 code (several of the gump2 bits
all worry about multi-threading) and making it difficult to read.

What you can do for example is multithread each of the three stages, then
join the threads in between. And each plugin might do multithreading on its
own. 

What I want to see first is where we need it. Instrument the different bits
of the build and find out where we need the speedups. Keep most of the code
simple! :-D

CLI
---
>> I've gotten the Gump3 branch into a state where
>> everything "works" (for me), as far as stuff is implemented. The main
> "core"
>> thing that is missing is cyclic dependency detection. I've got the right
>> algorithm written down on paper, just need to make it happen. The hooks
> for
>> it are there already though (the gump.engine.modeller.Verifier class).
> 
> Mind pumping a few command lines up to a wiki or somewhere? I'd like to run
> the engine, and unit tests, and such. Gump2 was a pain to run (we never
> cured it's confusion) and I'd like to start comfortably with Gump3 fro mthe
> get go.

Uhm, yeah, I do :-D. The interface should be so easy to use you don't need
the docs. Try "./gump help" for starters. There's work TODO here, but I
really prefer to update the code rather than the wiki!

> On thought in that regard is "partial runs". I think Gump2 was beleived
> (although not actually true) to be less "incremental build" friendly since
> it wouldn't allow one to do "build X", "update X". [It was there in Gump3,
> just the command line was so crude folks never got to use it.]. I feel we
> need Gump3 to be easy to run in pieces, and in parts.

I disagree, actually! The reason we needed to do stuff like that was because
gump is so complex and difficult to use that one resorts to a model of
"let's try this and see if works". We need to fix gump so that you don't
need to do that. IE, make it easy to write correct metadata.

I would like to make the "hacky" bits like this not part of the core. If you
need an adjusted profile with just a few projects, then change the profile!

> Easily asking for
> things that include/exclude components on the fly. Nicola's (and Sam's)
> wxPython GUI was a nice "user" this way. Any thoughts on re-instating that?

I'm not against GUIs, but I feel CLI is way more important to get right
first.

Plugins
------- 
>  I think that generating plug-ins (perhaps even for loading, and such) is
> key. I'm not sure (yet) if the new model is any better than the old in
> allowing the "core steps" (loading, modelling) to be pluged-in, but I think
> it need to be investigated.

Yes, its easy. Change the get_verifier() in config.py to provide a different
implementation, and that's it!

> I see you have a Maven parser, but could/should
> that be a plug-in?

I doubt we should be talking about this kind of stuff as a "plugin". There's
very specific bits of functionality that *need* to be performed (right
"contracts") for gump to work. To me, a plugin is something you can leave
out and still have something that basically works.

(and note there's no maven parser, just a hook to wire it in :-D)

> If you can leverage the framework here, perhaps in
> multi-stage runs (e.g. pre/run/post for loading metadata, pre/run/post for
> building, etc.) that might be nice. [Not sure if it is overkill, but I think
> it was a big weakness in Gump2 that needs to be addresses.]

I think its overkill. Right now we have 3 steps divided up into an event per
project, module, or repository. Ie that's 3000 distinct events on which a
plugin can decide to perform or not perform an action. I thought about
making it "n" steps (which would be easy to do in terms of code), but that
will just make the code hard to understand conceptually.

Maybe it has to be 2 steps, or 5. But 9 (3*3) seems like a lot.

Memory use
----------
> Another weakness of Gump2 was the (eventually huge) in-memory trees
> combining model and results. Hmm, I'm not sure if this goes away here (or
> not), and I fear not. How are we going to allow (say) a results plug-in to
> inject the build log (and/or commandline or whatever) into the results DB? I
> suspect it needs to reach out and touch the memory structures. Maybe little
> has changed here.

If we have 20k of result data (ie output of logs, which is a lot) for each
component and module, that's about 20000k of data in the model. I don't
think that's too much.

Its possible to put a RDBMS in between (I thought about doing that,
actually, it shouldn't be too hard) or better yet a berkely db, but I doubt
we really need to. It would be much easier to write just the logs to a
database and add a method on the model that fetches it. Ie the cvs logger
could do

 id = project.model.name + ":" + project.name + ":" + project.startdate + \
      ":cvs-up-log"
 self.db.exec("INSERT INTO logs (%s,%s)" % (id, log))

 def getCvsLog(self, id=id, db=db):
   return db.exec("GET log FROM logs WHERE id=%s" % id)[0]
 project.getCvsLog = getCvsLog

And a results plugin could do

 cvslog = project.getCvsLog()

And similar for the other large-amount-of-data plugins. You'd end up with a
much smaller in-memory model. But I want to be sure we need that before we
do it, and I really want to keep the ease of use of the in-memory model.

> [I half wondered about using XML file between components
> so we could completed run build and later run results generate. I never got
> to it 'cos I felt it was a lot of work and maybe overkill. Thoughts?] [Hmm,
> do we need a Wiki page w/ re-design goals/objectives to measure this
> framework against?]

Overkill. Let's do incremental design and let the requirements fall out by
themselves!

Architecture
------------
> I think we need to treat internal plug-ins the same as community added, i.e.
> east our own dog food. Do you know Python patterns for discovering and
> loading such plug-ins? I'd like to start by writting plug-ins that this
> framework can run. Is (say) an RDF generating plug-in "missing the point" of
> DynaGump, or something allowable? I'm game to start work on the DB interface
> for generating history, or others.

I really think we don't need "loading and discovery". Think small. Just edit
the config file directly. I really don't want to think about what is
"allowable" in terms of plugins. Just write lots of them if you feel like
it, and we can choose which to enable and disable based on whether they're
used.

That said, cocoon is all about XML so its probably a lot easier to do RDF
generation using a cocoon generator in dynagump!

>> The other stuff that's missing is a lot of plugins. The new architecture
> as
>> I set things up identifies three stages:
>> 
>> - preprocessing
>> - build/run
>> - postprocessing
> 
> This a tried and tested model you've used a lot in containers? Just curious
> of it's origins.

Nah, its just how I perceived gump2 to work:

1) load stuff
2) update stuff
   - "updater plugins" (cvs,svn,perforce)
3) build stuff
   - "builder plugins" (ant,maven,make)
4) actors to deal with that stuff
   - "results plugins" (rdf,html,xdocs)

(1) is a core concern and really needs to happen first. 2,3,4 probably need
to be easier to change, and have been rationalised into a single plugin
model. In Gump2 code it said "probably need to make these into actors" about
the update and the build code in some comment. So that's what I did. Then I
thought about that a little more and figured we still needed some stages to
keep things easy to understand.

> I wonder if (eventually) we'd like to be able to break
> Gump3 completely from the sequential "run", perhaps into an event-based
> engine.

That's totally doable, but its not how we think of our workflow processes.
As a programmer, you're used to steps. Update from cvs, change code, save,
run tests, generate builds, run integration tests, review code, add
comments, run tests again, commit. Making gump perform the same steps is a
good idea because its familiar.

>> And each of those can have plugins (basically what are now called actors).
>> Preprocessing plugins that need to be built include source repository
>> updaters. Build tools that need to be built include all the handlers for
> the
>> different Commands. Postprocessing that needs to be built include the
>> dynagump adapter. Basically everything :-D.
> 
> Hopefully we can snarf some of the existing code and re-work it. The nice
> thing about DynaGump is we don't need to copy over some of the more
> monsterous, or historical (DBM stats) code. It oughtn't take us too long.

Sure thing! What I want to be totally sure about is that copying code over
doesn't delute the codebase. Everything we can bring in to rework as simple
plugins is not a problem in that respect, because you can just throw a
plugin out and rewrite one from scratch.

But plugins need to be small and focussed. Like there's a plugin for cvs
updates, and one for svn updates. There's a plugin for ant builds. One to do
the <mkdir/> commands. Etc etc.

And we need to be real critical about what we need and don't need. I think
gump2 has shown us there's a lot of stuff in gump1 that we just don't need.
It was duplicated in gump2 then disabled. And we've also identified some
things we really need that aren't fully there yet (system totally built for
interacting with a range of different commands, a full history of the gump
tree and its transformations over time).

Why do we need RDF? Why do we need DBM? Let's answer those questions for
every bit of code we bring in.

Writing plugins
---------------
>> Those people that know the current pygump should no doubt see how so many
>> things are similar. However, there's also a lot of differences, small and
>> big. The big ones IMHO are:
>> 
>>  * components. Most stuff is implemented as a component. Dependencies and
>>    config is passed into components through __init__. This leads to
> 
> Any support for "cascading decisions". I.e. If we find a DB is available
> don't install a DBM stats driver (hopefully we'll through that away, this is
> just an example) but use a MySQL one. Given that, and given historical DB,

Decisions like that can be made in config.py. Or you break out a support
class that is used from config.py if the file gets too big. This is so easy
to do in python its silly:

 def get_db(log):
   try:
     from gump.util.mysql import Database
     return Database()
   except ImportError:
     log.info("MySQL unavailable, default to DBM...")
     from gump.util.dbm import Database
     return Database()

>>  * Testable code. Though I haven't started writing tests yet, that will be
>>    easy now. For example, there's but two places in gump where we write to
>>    files (the bootstrap logger and the VFS layer) directly. Swapping in
>>    place a StringIO buffer is trivial.
> 
> I like this goal. Gump had lots of tests, but it's core run/engine let it
> down. I ended up using small test workspaces to do testing, and that was ok,
> but I'd like to see better.

Those tests really acted like integration tests. Basically integration tests
to test the integration test (which gump is, after all). Small components,
small unit tests. Unit tests are much easier to write if you plan for them
:-D

>>  * model symplification. The gump2 gump.model package is huge and kind-of
>>    difficult to grasp. All the difficult bits are now elsewhere. This
> makes
>>    it a lot easier to write plugins.
> 
> Not sure I see this yet, but I'd like to understand it more.

Just take a look at the gump/model/__init__.py file. Note how small it is
compared to all the lines of code in the gump2 model. Stuff like the
complete() functionality in gump2 needs to be elsewhere.

The Road Ahead (:D)
-------------------
>>  * it doesn't work! Gump2 actually runs, and runs relatively quickly
> (being
>>    multi-threaded and all). Running Gump3 right now only results in some
>>    logging output.
> 
> Ah, kinda like the code that Sam left when he started Gump2. :-) I forget
> how many hours I spent learning Python, understanding Gump's model
> interactions/complexities, and fighting silly things in Python (like not
> being able ot time out a spawned process) in order to get it working. That
> said, we have all that in SVN we can leverage here. The smart choices of
> splitting work off for DynaGump is key to making Gump3 a lot simpler.

Exactly! I spent the last two weeks mostly looking at code in gump2 and
extracting what felt as the "core" bits. The first gump had absolutely
horrible architecture (XSLT for generating shell scripts, ugh), then the
python one got real about being serious software and along the way figured
out the basics. We can now leverage that and build something clean starting
from those basics.

>> So what we have in Gump3 right now is three empty shells (a blank
> "pygump",
>> a blank "dynagump" and a blank database) that should be able to do just
>> about anything after they're filled with the right components.
>> 
>> What I would like now is a beer and some feedback :-D
> 
> You've earned your beer, enjoy it. I hope I can find time to work with you
> to help flesh this out. I think there is a ways to go in making this
> theoretical model handle all we might want it to for Gumping, but I'm
> interested in trying. I think that a nice simple/clean framework like this,
> combined with a whole lot of throwing out the old, (and only bringing it
> over if it is valuable) is going to work wonders for help Gump become what
> it can be, and being maintainable.

I'm real glad you feel the same way! We're all pressed for time I guess, but
I hope some of the hard bits are done so we can get to work on stuff in a
more piecemail way...

> Thanks Leo. Good job. [and now my mind is racing w/ thoughts around this,
> thanks for waking me up! I hope I don't cut a finger off w/ the jaws 'cos
> I'm distracted. ;-)]

Hehehe. Do let us know you're alright dude!

Final thought (for now): let's not get carried away talking about frameworks
and component architectures and inter-component communication and
model-driven architecture too much. A dynamic language like python enables
you to *not* have to think about a lot of those things. As long as you
follow the KISS principle for every line of code you write and partition the
code into really small chunks you should be fine.

What I hope to have figured out is a way (it probably isn't the best way,
but hey, its what I know) to make it easier to write small chunks of gump.
In terms of architectural decisions, that's probably saying something like
"Hey, dude! Let's build a house using small bricks! But how? We'll glue 'ehm
together using concrete!", and that's really all we need.

Cheers,

- LSD



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: The Gump3 branch

Reply via email to