Re: MNG-3004/MNG-2802 - Achieving massive parallelity ?

Ralph Goers Sun, 22 Nov 2009 18:43:35 -0800

One downside to this kind of parallelism that you should consider is the 
output. If all these tasks are writing to stdout or stderr simultaneously it is 
going to make the build output hard to understand.  It might be preferable to 
pipe the output from a thread into a cache and then write the whole cache at 
once while locked to stdout.


Ralph

On Nov 22, 2009, at 12:06 PM, Dan Fabulich wrote:

> I like it!
> 
> Well, except for the "1 thread per module" part; that's clearly too many 
> threads.  You'd want a fixed thread pool.
> 
> But restructuring the multithreading around the individual phases and 
> *scheduling* phases from later projects when earlier project phases are done 
> seems workable.
> 
> We probably would have thought of this earlier (or, at least, *I* would have) 
> if the default reactor behavior worked like that.
> 
> Today, "mvn install" will first compile, test, and install project A, then 
> compile, test, and install project B, and then compile, test and install 
> project C.
> 
> But even without multithreading, you could imagine the reactor compiling A, 
> then compiling B, then compiling C, then testing A, then testing B, then 
> testing C, and finally installing A, installing B, and installing C. This 
> strategy might fail faster than today's project-by-project strategy.
> 
> I propose that we first implement this as an optional reactor strategy (via a 
> special command-line argument) in singlethreaded mode, and work out all the 
> kinks.  Once we're pretty happy with that, we can add support for it in 
> multithreaded mode.  It's especially important that it be at least *possible* 
> to run it in singlethreaded mode, in case it causes problems for some 
> projects independently of multithreading.
> 
> BTW, what would we call this new mode?  Perhaps we'd call it "weave" mode, 
> because we're going across the projects horizontally in lifecycle order. I 
> initially thought we might call it "breadth-first" as opposed to 
> "depth-first," but that's a bad name because it sounds like we're reordering 
> the projects.
> 
> 
> So, where might we find kinks in "weave" mode?
> * Correctly implementing reactor failure behavior (--fail-fast, 
> --fail-at-end, --fail-never) with blacklisting
> * What happens if we specify multiple lifecycles?  "mvn compile test"
> * What happens if we just specify the raw goals? "mvn myPlugin:goal"
> * What if we mix and match? "mvn compile myPlugin:goal test"
> * What if we put clean last?  Would we clean projects while later projects 
> depend on them? "mvn compile clean"
> * What if the reactor is building a plugin that is used later in the reactor?
> * How would users resume "weave" mode?  (Today we allow users to 
> --resume-from a particular project.)  Would "weave" users resume from a 
> particular project+phase?  Would resuming even be reasonable?  If you changed 
> a class, you'd need to recompile it and THEN retest it...
> 
> These are the sort of areas where we'd want to have a good singlethreaded 
> implementation with integration tests BEFORE plowing ahead with a 
> multithreaded implementation.
> 
> -Dan
> 
> Kristian Rosenvold wrote:
> 
>> I've looked over the code and thought a bit further about the
>> constraints involved, and given that:
>> 
>> - Multi module reactor builds are the only interesting targets of
>> multithreading.
>> - Reactor builds do not use the "install" output of their upstream
>> dependencies (I was not aware of that ;)
>> 
>> You do not have to re-order anything at all. An implementation
>> could just:
>> A) Immediately fork 1 thread per module for all modules.
>> B) For the phases compile, install and deploy, a given module can
>> only proceeed when all its upstream dependencies have completed the same
>> state
>> There's still a chance of leaking artifacts to local repository if
>> upstream deploy fails after install, and the general idea of a
>> transacted repo would still be nice to stay consistent.
>> 
>> I'm still a bit sure about B) above, it may be a bit limiting in terms
>> of other usage scenarios. I'm also a bit sure how that'd fit in with all
>> the other activities in the lifecycle. An alternative would be to
>> make a declarative-representation of phase-interdependencies that could
>> express multiple types of concurrency-interdependencies. (But I
>> consistenly only see one dependency type -
>> upstreamMustFinishBeforeThisCanStart...?)
>> 
>> Would it float ?
>> 
>> Kristian
>> 
>> 
>> lø., 21.11.2009 kl. 11.40 +0000, skrev Stephen Connolly:
>>> In m3 (which is what we are talking about) AFAIK we can have a
>>> listener that waits for the end of the start of the deploy phase
>>> and/or the end of execution.
>>> 
>>> With a customized install plugin, we could just install to the
>>> "transaction" repository.  The listener can then block until the
>>> criteria have been met (allowing other modules to progress) That would
>>> achieve what you're after... namely, produce the artifacts for
>>> consumption by the other modules before running test and
>>> integration-test. Once the criteria have been met, we either fail the
>>> module or we move the artifacts from the "transactional" local repo to
>>> the real local repo and allow the lifecycle to continue
>>> 
>>> -Stephen
>>> 
>>> 2009/11/21 Kristian Rosenvold <[email protected]>:
>>>> I seem to understand that there's room for several different
>>>> types of solution here;
>>>> 
>>>> Starting with the single-machine solution; I now understand that
>>>> you could start forking downstream builds straight after
>>>> compile in a reactor build, maybe after install in other cases.
>>>> 
>>>> In this scenario I think each module is dependant on all upstream
>>>> modules successfully achieving "install" before proceeding to "deploy".
>>>> I really think it's important to avoid leaking artifacts that do not
>>>> have its own (and all upstream) lifecycle requirements fulfilled.
>>>> 
>>>> When it comes to clustering there may be several approaches:
>>>> If you decide to publish artifacts through "deploy" to any kind
>>>> of repo I believe these require to have all lifecycle requirements met,
>>>> which at my current understanding seems orthogonal to local out-of-order
>>>> execution.
>>>> 
>>>> Wouldn't it be feasible to distribute the "local" and perhaps
>>>> "transacted local" repo inside the cluster using network
>>>> file sharing ? One would still have to solve serialization issues
>>>> and using installed artifacts in a reactor build..?
>>>> 
>>>> The clustering case seems like a much harder task than achieving
>>>> full local concurrency. I did some fairly extensive measurements
>>>> with my current build when I set up concurrent spring/junit testing:
>>>> 
>>>> Missing concurrency in classloading is the most important reason
>>>> why unit tests run slowly (classloading is strictly a synchronized
>>>> business until jdk7). By running tests out-order on my local
>>>> unit test-build I am fairly certain I could reduce run-time
>>>> for "mvn clean install" to something much closer to "mvn
>>>> -Dmaven.test.skip=true clean install" (80->25 seconds in my case).
>>>> This is even before I start parallelizing the individual modules.
>>>> 
>>>> I must confess that I've yet to see a build that really needs
>>>> clustering for any other reason than running tests or other individual
>>>> tasks (javadoc, site etc). I think I'd be inclined to just distributing
>>>> those specific tasks in a cluster. If you actually had a decent model of
>>>> inter-lifecycle phase dependencies (requiredForStarting between phases),
>>>> you could probably achieve good results by keeping lifecycle execution
>>>> centralized but ditributing plugin execution ?
>>>> 
>>>> I suppose I may be narrow-minded on this last one...
>>>> 
>>>> I will be starting to look at the DefaultLifeCycleExecutor with thoughts
>>>> of out-of-order execution, maybe dabble around a little.
>>>> 
>>>> Kristian
>>>> 
>>>> fr., 20.11.2009 kl. 06.29 -0800, skrev Dan Fabulich:
>>>>> I've been meaning to reply to your earlier emails (it's been a busy week);
>>>>> to this I'll just say that moving the "test" phase after the "install"
>>>>> phase is a fascinating idea, which I personally like, but it seems like a
>>>>> big violation of the contract for the lifecycle, and I suspect it won't be
>>>>> popular. :-(
>>>>> 
>>>>> I've long felt that there should be a phase for testing after "install"
>>>>> for similar reasons.  This might be SLIGHTLY more popular since users
>>>>> would need to explicitly cause their tests to run during this phase.
>>>>> 
>>>>> What about users doing multi-machine builds?  Earlier this week I wrote
>>>>> that users desiring to do multi-machine parallelism should deploy their
>>>>> builds to a remote repository shared between the machines.  Should their
>>>>> tests run post-deploy?
>>>>> 
>>>>> -Dan
>>>>> 
>>>>> 
>>>>> Kristian Rosenvold wrote:
>>>>> 
>>>>>> I've been thinking further about parallelity within maven. The proposed
>>>>>> solution to MNG-3004
>>>>>> achieves parallelity by analyzing inter-module dependencies and 
>>>>>> scheduling
>>>>>> parallel dependencies in parallel.
>>>>>> 
>>>>>> A simple further evolution of this would be to collect and download all
>>>>>> external dependencies
>>>>>> for all modules immediately.
>>>>>> 
>>>>>> But this idea has been rummaging in my head while jogging for a week or 
>>>>>> so:
>>>>>> 
>>>>>> Would it be possible to achieve super-parallelity by describing
>>>>>> relationships between phases of the build, and even reordering some of 
>>>>>> the
>>>>>> phases ? I'll try to explain:
>>>>>> 
>>>>>> Assume that you can add transactional ACID (or maybe just AID) abilities
>>>>>> towards the local
>>>>>> repo for a full build. Simply put: All writes to a local repo is done in 
>>>>>> a
>>>>>> per-process-specific instance of the repo, that can be rolled back if the
>>>>>> build fails (or pushed to the local repo if
>>>>>> the build is ok)
>>>>>> 
>>>>>> If you do that you can re-order the life-cycle for most builds to be
>>>>>> something like this:
>>>>>> 
>>>>>> validate
>>>>>> compile
>>>>>> package
>>>>>> install
>>>>>> test
>>>>>> integration-test
>>>>>> deploy
>>>>>> 
>>>>>> Notice that I just moved all the "test" phases after the "install" phase.
>>>>>> Theoretically you could start any subsequent modules immediately after
>>>>>> "install" is done. Running of tests is really the big killer in most
>>>>>> multi-module projects I see.
>>>>>> 
>>>>>> Since your commit "push" towards the local repo only happens at the very 
>>>>>> end
>>>>>> of the build, you
>>>>>> will not publish artifacts when tests are failing (at leas not project
>>>>>> output artifacts)
>>>>>> 
>>>>>> You could actually make this a generic model that describes deifferent 
>>>>>> kinds
>>>>>> of
>>>>>> dependencies between lifecycle phases of different modules. The 
>>>>>> dependency I
>>>>>> immediately
>>>>>> see is "requiredForStarting" - which could be interpreted as meaning that
>>>>>> any upstream
>>>>>> dependencies must have reached at least that phase before the phase can 
>>>>>> be
>>>>>> started
>>>>>> for this project. I'm not sure if there's any value in a generic model, 
>>>>>> but
>>>>>> my perspective
>>>>>> may be limited to what I see on a daily basis.
>>>>>> 
>>>>>> Would this be feasible ?
>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: MNG-3004/MNG-2802 - Achieving massive parallelity ?

Reply via email to