Hi Everyone,

I've been thinking about this question of the Drake implementation vs. the 
ThreadPool implementation and I wanted to share my thoughts. I had no idea the 
resulting email would be so long. It's my hope to offer interesting points for 
discussion.

These are all ordered by importance so you can bail when you like :)

Please bear with me...


What Should -j mean? (Part 1.)

There are two features for which I've made pull requests:

 1 - Limit the number of concurrent tasks executing.
 2 - All tasks process their prerequisites in parallel.

Both of these features are activated with separate flags: -j and -m, 
respectively. Neither feature requires the other. They are complementary.

Drake uses one flag to specify both features but there is no technical reason 
why Rake couldn't also activate both features with a single -j.

I raise this to separate the issue of "what -j means" from the possibly larger 
issue of the advantages of the drake implementation.


A Perk of the ThreadPool Implementation

The reason I ask if the issue isn't simply about "what -j means" is because the 
drake implementation is documented as breaking the existing contract exposed by 
the Rake API. From the drake page ( 
http://quix.github.com/rake/files/doc/parallel_rdoc.html ):

    Task#invoke inside Task#invoke
    
    Parallelizing tasks means surrendering control over the micro-management
    of their execution. Manually invoking tasks inside other tasks is rather
    contrary to this notion, throwing a monkey wrench into the system. An
    exception will be raised when this is attempted in -j mode.

The ThreadPool implementation does not share this same limitation or limit any 
features of the Rake API.

[A use case for this is below...]


What Should -j mean? (Part 2.)

As a Rakefile author, I have found a lot of utility in being able to 
incrementally parallelize my Rakefile. Allowing both task and multitask enables 
me to quickly activate parallelization for a section of my Rakefile. I like 
that if I've detected a parallelization bug, I can quickly fix it by simply 
removing the parallelization for that section, leaving the rest of the file to 
remain in parallel (which hopefully still maintains good performance). I've 
been grateful for those times when I can quickly fix the build by changing a 
multitask to a task.

Being able to choose between task and multitask has always seemed to me a 
gentler way to allow authors to parallelize their Rakefiles while retaining the 
power to really take advantage of the machine upon which it runs.

That's why I like the separation of the -m option.


Use Case For Task#invoke inside Task#invoke

Being able to call and activate tasks on the fly is also important to me 
because the build system at my job uses Task#invoke from within another 
Task#invoke. It's possible that I'm misusing Rake (and if so, this is a great 
opportunity for me to get a better solution from the community).

Here's how we use Task#invoke:

Our build system has a packaging component which creates a deployable "package" 
containing variations of the product, and a collection of global items used by 
all variations. For each product variation, there is a binary of the build with 
its corresponding symbol files.

Package
-------
- variations
  - debug
    - product.exe
    - product.pdb
  - release
    - ...
  - debug-only-feature-A
  - release-only-feature-B
  - etc...
- global-items
  - assets
  - manifest
  - etc...

We need to be able to specify at the rake command-line:
 - Which variations will be included
 - Overall options that affect every variation in the package
 
I tried to write a Rakefile that would take all those options and build a giant 
dependency tree. Inside a enumeration of variations would be a declaration for 
the current variation for our :build task. The :build task would be declared 
with a unique name based on the configuration, essentially creating a 
parametrized task (akin to C++ templates). These would all depend on a 
resulting :package task. Each variation would depend on a prerequisite, which 
would all depend on a single task :preprocess_assets

Here's pseudo-code:

  multitask :preprocess_assets => asset_tasks do |t,args|
    [code]
  end

  variations.each do |variation|
  
    task "build_prereq(#{variation.to_s})" => :preprocess_assets do |t,args|
      [code]
    end
  
    task "build(#{variation.to_s})" => "build_prereq(#{variation.to_s})" do 
|t,args|
      [use variation in build code]
    end
    
    task :package => "build(#{variation.to_s})"

  end

  task :package do |t,args|
   [packaging code]
  end

Here's an ascii diagram (note that there were many more variables than "conf" 
and "features"):

                             [asset,asset,...] <-- (in parallel)
                                       |
                               :preprocess_assets 
------------------------------------
                               /           |    \                               
      \
  "build_prereq(conf=release,features=A,B) |  
"build_prereq(conf=debug,features=A,B)" |
  |     "build_prereq(conf=debug,features=A)" /  
"build_prereq(conf=release,features=B)"
  |                               |          /                              /
"build(conf=release,features=A,B) |    "build(conf=debug,features=A,B)"    /
  |      "build(conf=debug,features=A)" /      "build(conf=release,features=B)"
  \         |                          /            /
   \        \                         /            /
     ----------------------------- :package -------


It seemed very straightforward, but it was difficult to read and debug the 
Rakefile. All the task names were generated (making them hard to find in the 
code when referenced from rake output) and the tree was very large.

Using Task#invoke allowed me to get rid of all the parameterization and create 
a Rakefile that better matched the flow of the process and was simpler to read.

  multitask :preprocess_assets => asset_tasks do |t,args|
    [code]
  end

  task :build_prereq, [:conf, :features] => :preprocess_assets do |t,args|
    [code]
  end
  
  task :build, [:conf, :features] => :build_prereq do |t,args|
    [use args]
  end
  
  task :package do |t,args|
  
    variations.each do |variation|
      Rake::Task[:build].invoke(*variation)
      [reenable :build and its prerequisites]
    end
  
    [packaging code]
  end


Here's an ascii diagram

    [asset,...] <-- (in parallel)
         |
  :preprocess_assets
         |
    :build_prereq
         |
       :build   <--loops over-- :package



Keeping Rake Flexible

On a more general note, Rake has always been presented to me as an API to 
enable dependency-based programming and the DSL is a (significant) perk 
enabling writing a dependency tree in a declarative style. But as far as I 
know, there has never a formal boxing of the Rake system into "declare tasks" 
mode and "execute tasks" mode which it seems the drake implementation 
encourages, if not requires.


Thank you for making it this far. I look forward to the discussion generated by 
these points.

Sincerely,

_ michael bishop





On Oct 23, 2012, at 12:18 PM, Jim Weirich <[email protected]> wrote:

> 
> On Oct 22, 2012, at 4:04 PM, Hongli Lai <[email protected]> wrote:
> 
>> Conservative is one thing, but drake was written 2 years ago. There has been 
>> no response every time someone asks why drake was not merged.
> 
> My main problem with drake is that it adds a second task execution engine 
> that is subtly different the mainline rake engine.  The difference isn't 
> critical and most projects won't even notice the difference, but having two 
> similar but different engines offends my sensibilities.
> 
> If drake were to be merge, I would want to either (a) discard the current 
> engine and use drake's engine exclusively, or (b) make the parallelization 
> mechanism work more closely with the current rake engine.
> 
> I know drake uses a dry-run pass to compute the dependency tree, but I'm not 
> sure if the dry run pass uses the regular rake engine (which might impact 
> option (a)) or if it does its own thing.
> 
> In any case, a drake merge won't happen in the 0.9.x series as I would like 
> to work out the current bug list and hit some simple features.  The Thread 
> pool looked like an easy win and is really needed for the multitask stuff 
> anyways. Michael has also proposed a -m option that implicitly turns tasks 
> into multitasks, and I'm considering that instead of a drake integration.
> 
> However, if the -m flag is deemed inadequate, I will probably hold off on the 
> thread pool as well and reconsider a drake move a bit farther down the line.
> 
> Thoughts are welcome.
> 
> (Postscript: I also have some concerns about turning on parallel execution in 
> arbitrary Rakefiles.  I suspect it will work fine in projects that most shell 
> out to compilers and linkers, but Rakefiles that run most Ruby code will 
> probably be broken in ways that are hard to detect and reproduce. If anyone 
> has any ideas on addressing that issue, I would love to hear them.)
> 
> -- 
> -- Jim Weirich
> -- [email protected]
> 
> 
> 
> 
> 
> _______________________________________________
> Rake-devel mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/rake-devel

_______________________________________________
Rake-devel mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/rake-devel

Reply via email to