Hello,
In our present system of equations, we sweep 8 loosely coupled PDEs defined in
5 meshes, until their residues die down to an acceptable value.
Upon profiling the code, it's clear that the sweep function is a major
computational bottleneck. We tried MPI parallelism, but that didn't work out.
But to side-step this, recognising the fact that the sweep() functions are
computed one-by-one in serial, we propose an interesting work-around for an
alternative style of parallelism.
Instead of dividing the control volumes among multiple CPUs, we instead propose
to split the 8 different sweep calls (in the sweep loop within a time-step) to
worker cores. Each sweep can then occur independently (instead of the python
interpreter waiting for one sweep to finish before beginning another). I
suspect that this would be beneficial for those problems with 6-8 sweep calls,
wherein the sweeps are CPU intensive. Thus, the cost of communication overhead
between the master process to the child workers is offset by the benefits
gained by independent sweeping.
>From my fairly underwhelming numerical computing knowledge, this looks like
>analogous to a Jacobi-style implementation, rather than a Gauss-Seidel style
>iteration, wherein the updated cell-variable from the sweep is immediately
>available for all subsequent PDEs of the coupled system within the sweep-loop.
> In the case of multi-processor sweep, the values are updated using values
>from last sweep iteration. Thus, this might be slower (but more stable) to
>converge, isn't it ? There is also a penalty incurred by the communication
>overhead of opening and killing processes.
The questions are : 1) Has anybody tried this before ? 2) Does this sound
problematic from a technical or practical perspective. 3) Does this sound
remotely useful.
I have successfully implemented a multiprocessor parallelisation for
examples.diffusion.coupled (shown below) . The values of the simulation results
(for v0 & v1) after 100 time-steps are very very close to that of the serial
sweep. Although this code is several times slower than the serial code, we
tried this more as a concept demonstrator before embarking to spend significant
effort to convert our serial code to multiprocessor approach. If you see a red
flag, it would be much appreciated if you can help us by pointing it out.
from fipy import Grid1D, CellVariable, TransientTerm, DiffusionTerm, Viewer
from multiprocessing import Process, Queue, cpu_count
m = Grid1D(nx=100, Lx=1.)
v0 = CellVariable(mesh=m, hasOld=True, value=0.5)
v1 = CellVariable(mesh=m, hasOld=True, value=0.5)
def boundaryConditions(m,v0,v1):
v0.constrain(0, m.facesLeft)
v0.constrain(1, m.facesRight)
v1.constrain(1, m.facesLeft)
v1.constrain(0, m.facesRight)
## Multiprocessor implementation
#
def get_res0(res0_queue,v0,v1):
boundaryConditions(m, v0, v1)
eq0 = TransientTerm() == DiffusionTerm(coeff=0.01) - v1.faceGrad.divergence
res0 = eq0.sweep(var=v0, dt=1e-5)
res0_queue.put([res0,v0])
def get_res1(res1_queue,v0,v1):
boundaryConditions(m, v0, v1)
eq1 = TransientTerm() == v0.faceGrad.divergence + DiffusionTerm(coeff=0.01)
res1 = eq1.sweep(var=v1, dt=1e-5)
res1_queue.put([res1,v1])
if __name__ == '__main__':
print "You have " + str(cpu_count()) + " CPU cores in your machine"
print "Simulation end time is 99 seconds.\n"
res0_queue = Queue()
res1_queue = Queue()
for t in range(100):
print "Computing the solution at time, t = " + str(t) + " sec"
v0.updateOld()
v1.updateOld()
res0 = res1 = 1e6
while max(res0,res1) > 0.01:
process_a = Process(target=get_res0, args=(res0_queue,v0,v1))
process_b = Process(target=get_res1, args=(res1_queue,v0,v1))
process_a.start()
process_b.start()
process_a.join()
process_b.join()
while not res0_queue.empty():
res0, v0 = res0_queue.get()
while not res1_queue.empty():
res1, v1 = res1_queue.get()
print v0, v1
print res0, res1
___
fipy mailing list
fipy@nist.gov
http://www.ctcms.nist.gov/fipy
[ NIST internal ONLY: https://email.nist.gov/mailman/listinfo/fipy ]