Hi 黄若尘, Armin,


> > I thought about that too, but the granularity is very wrong for STM:
> > the overhead of running tiny transactions will completely dwarf any
> > potential speed gains.

Then maybe we should go with a slightly different version of STM, specific to 
assembler loops in particular.

A loop in assembly language operates on a linear memory model and usually 
modifies only a small number of memory cells. In fact, most changes made by a 
loop iteration get overwritten in the next iteration. Assuming that a single 
assembler instruction may modify only one memory cell, the number of cells 
changed will be no more than the count of loop iterations.

We could replace (para-virtualize) any instruction that changes a memory cell 
with two stack pushes: save the memory address to the stack and the actual 
value that is written. Any memory read will also be replaced with the stack 
search before falling back to reading the actual memory contents. This is a big 
penalty of course but it's worth checking whether it pays in the future.

This is a simple, assembler-specific flavor of STM.

Then we could employ loop scheduling and run the modified code on all cores. 
Then we could check whether all the memory modifications agree, that means 
whether any two cores did not try to write different value to the same memory 
address. If not, then we could commit the transaction and exit the loop.

The kind of loops that would benefit most from such optimization would be 
memset, memcpy and all map-like constructs:

dst = map(fun, src)

--->

for(int i = 0; i < len(src); i++)
 dst[i] = fun(src[i]);


> can we just make a hybrid-system that firstly slightly screen 
> some loops that is not suitable for parallelization and then run others with 
> STM?

Following the general "try and fail" philosophy of Python, I would suggest the 
following:
Just run the unmodified loop on one core and use the other cores to 
optimize/execute the modified version. If the optimization turns out unsuitable 
or the serial execution ends first, just abort the optimized run. If the loop 
turns out to be parralelizable, return the results instead.

Thanks
haael



Od: "黄若塵" <[email protected]>
Do: "Armin Rigo" <[email protected]>; [email protected]; 
Wysłane: 9:07 Środa 2014-11-26
Temat: Re: [pypy-dev] An idea about automatic parallelization in PyPy/RPython

> Hi Haael, Rigo,
> > 2014/11/21 19:21、Armin Rigo  のメール:
> > 
> > Hi Haael, hi 黄若尘,
> > 
> > On 21 November 2014 10:55,   wrote:
> >> I would suggest a different approach, more similar to Armin's idea of 
> >> parallelization.
> >> 
> >> You could just optimistically assume that the loop is parallelizable. Just 
> >> execute few steps at once (each in its own memory sandbox) and check for 
> >> conflicts later. This also plays nice with STM.
> > 
> > I thought about that too, but the granularity is very wrong for STM:
> > the overhead of running tiny transactions will completely dwarf any
> > potential speed gains.  If we're talking about tiny transactions then
> > maybe HTM would be more suitable.  I have no idea if HTM will ever
> > start appearing on GPU, though.  Moreover, you still have the general
> > hard problems of automatic parallelization, like communicating between
> > threads the progress made; unless it is carefully done on a
> > case-by-case basis by a human, this often adds (again) considerable
> > overheads.
> 
> Well, recently I have read some papers about TLS, and also realized the heavy
> performance penalty of STM. What am I considering is that, is it possible to 
> simplify a STM for the trace generated by RPython using some features of it 
> (for
> example there is no control flow but only guard; there are some jit.elidable 
> functions
> in the interpreter), or, can we just make a hybrid-system that firstly 
> slightly screen 
> some loops that is not suitable for parallelization and then run others with 
> STM?
> 
> > 
> > To 黄若尘: here's a quick answer to your question.  It's not very clean,
> > but I would patch rpython/jit/backend/x86/regalloc.py, prepare_loop(),
> > just after it calls _prepare().  It gets a list of rewritten
> > operations ready to be turned into assembler.  I guess you'd need to
> > check at this point if the loop contains only operations you support,
> > and if so, produce some different code (possibly GPU).  Then either
> > abort the job here by raising some exception, or if it makes sense,
> > change the 'operations' list so that it becomes just a few assembler
> > instructions that will start and stop the GPU code.
> > 
> > My own two cents about this project, however, is that it's relatively
> > easy to support a few special cases, but it quickly becomes very, very
> > hard to support more general code.  You are likely to end up with a
> > system that only compiles to GPU some very specific templates and
> > nothing else.  The end result for a user is obscure, because he won't
> > get to use the GPU unless he writes loops that follow exactly some
> > very strict rules.  I certainly see why the end user might prefer to
> > use a DSL instead: i.e. he knows he wants to use the GPU at specific
> > places, and he is ready to use a separate very restricted "language"
> > to express what he wants to do, as long as it is guaranteed to use the
> > GPU.  (The needs in this case are very different from the general PyPy
> > JIT, which tries to accelerate any Python code.)
> > 
> > 
> > A bientôt,
> > 
> > Armin.
> 
> 

_______________________________________________
pypy-dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to