Re: [pypy-dev] Interpreter level array implementation

2010-07-04 Thread Paolo Giarrusso
Hi Carl,
first, thanks for reading and for your explanation.

On Sat, Jul 3, 2010 at 10:03, Carl Friedrich Bolz cfb...@gmx.de wrote:
 2010/7/3 Paolo Giarrusso p.giarru...@gmail.com:
 It requires thinking. It's harder to do because we don't know
 statically upfront how many paths we'll compile to assembler, but I
 can think about ways to mitigate that.

 Isn't there some existing research about that in the 'tracing'
 community? As far as I remember, the theory is that traces are
 assembled in trace trees, and that each time a (simplified*) SSA
 optimization pass is applied to the trace tree to compile it. Not sure
 whether they do it also for Javascript, since there compilation times
 have to be very fast, but I guess they did so in their Java compiler.

 There are two ways to deal with attaching now traces to existing ones.
 On the one hand there are trace trees, which recompile the whole tree
 of traces when a new one is added. This can be costly. On the other
 hand, there is trace stitching, which just patches the existing trace
 to jump to the new one. PyPy (and TraceMonkey, I think) uses trace
 stitching.

For TraceMonkey, response times suggest the usage of trace stitching.
The original Java compiler used trace trees. But if I have a Python
application server, I'm probably willing to accept the bigger
compilation time, especially if compilation is performed by a
background thread. Would it be possible to accommodate this case?

 The problem with loop-invarian code motion is that when you stitch in
 a new trace (what we call a bridge) it is not clear that the code that
 was invariant so far is invariant on the new path as well.
I see - but what about noting potential modifications to the involved
objects and invalidating the old traces, similarly to how classloading
invalidates other optimizations?
Of course, some heuristics and tuning would be needed I guess, since I
expect that invalidations here would be much more frequent otherwise.
Such heuristics would probably approximate a solution to the problem
mentioned by Maciej:
 It requires thinking. It's harder to do because we don't know
 statically upfront how many paths we'll compile to assembler, but I
 can think about ways to mitigate that.

However, I still wonder how easy it is to recognize a potential write.
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev


Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Hakan Ardo
On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski fij...@gmail.com wrote:
 General note - we consider 2x optimized C a pretty good result :) Details 
 below

As do I :) I just want  to make this as jit-friendly as possible
without rely knowing what's jit-friendly...

 Yes. We don't do loop invariant optimizations for some reasons, the
 best of it being the fact that to loop you can always add a bridge
 which will invalidate this invariant.

Are you telling me that you probably never will include that kind of
optimization because of the limitations it imposes on other parts of
the jit or just that it would be a lot of work to get it in place?

What is a bridge?

-- 
Håkan Ardö
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev


Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Maciej Fijalkowski
On Sat, Jul 3, 2010 at 12:14 AM, Hakan Ardo ha...@debian.org wrote:
 On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski fij...@gmail.com wrote:
 General note - we consider 2x optimized C a pretty good result :) Details 
 below

 As do I :) I just want  to make this as jit-friendly as possible
 without rely knowing what's jit-friendly...

I think it's fairly JIT friendly. You can look into traces (as you
did), but seems fine to me.


 Yes. We don't do loop invariant optimizations for some reasons, the
 best of it being the fact that to loop you can always add a bridge
 which will invalidate this invariant.

 Are you telling me that you probably never will include that kind of
 optimization because of the limitations it imposes on other parts of
 the jit or just that it would be a lot of work to get it in place?

It requires thinking. It's harder to do because we don't know
statically upfront how many paths we'll compile to assembler, but I
can think about ways to mitigate that.


 What is a bridge?

If guard fails often enough, it's traced and compiled to assembler.
That's a bridge


 --
 Håkan Ardö

___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Paolo Giarrusso
On Sat, Jul 3, 2010 at 08:20, Maciej Fijalkowski fij...@gmail.com wrote:
 On Sat, Jul 3, 2010 at 12:14 AM, Hakan Ardo ha...@debian.org wrote:
 On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski fij...@gmail.com wrote:
 General note - we consider 2x optimized C a pretty good result :) Details 
 below

 As do I :) I just want  to make this as jit-friendly as possible
 without rely knowing what's jit-friendly...

 I think it's fairly JIT friendly. You can look into traces (as you
 did), but seems fine to me.

 Yes. We don't do loop invariant optimizations for some reasons, the
 best of it being the fact that to loop you can always add a bridge
 which will invalidate this invariant.

 Are you telling me that you probably never will include that kind of
 optimization because of the limitations it imposes on other parts of
 the jit or just that it would be a lot of work to get it in place?

 It requires thinking. It's harder to do because we don't know
 statically upfront how many paths we'll compile to assembler, but I
 can think about ways to mitigate that.

Isn't there some existing research about that in the 'tracing'
community? As far as I remember, the theory is that traces are
assembled in trace trees, and that each time a (simplified*) SSA
optimization pass is applied to the trace tree to compile it. Not sure
whether they do it also for Javascript, since there compilation times
have to be very fast, but I guess they did so in their Java compiler.

Also, in other cases the general JIT approach is 'optimize and
invalidate if needed'. For instance, if a Java class has no subclass,
it's not safe to assume this will hold forever to perform
optimization; but the optimization is performed and a hook is
installed so that class loading will undo the optimization.

Another issue: what is i4 for? It's not used at all in the loop, but
it is reset to 27 at the end of it, each time. Doesn't such a var
waste some (little) time?

* SSA on trace trees took advantage of their simpler structure
compared to graphs for some operations.
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Armin Rigo
Hi Alex,

On Fri, Jul 02, 2010 at 03:12:19PM -0500, Alex Gaynor wrote:
 In addition to the things you noted, I guess the int overflow check
 can be optimized out, since i+=1 can never cause it to overflow given
 that i is bounded at 640*480.  I suppose in general that would require
 more dataflow analysis.

Hakan mentioned this.  It's actually an easy optimization in our linear
code; I guess I will give it a try.


A bientot,

Armin.
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev


Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Armin Rigo
Hi Paolo,

On Sat, Jul 03, 2010 at 08:58:34AM +0200, Paolo Giarrusso wrote:
 Isn't there some existing research about that in the 'tracing'
 community?  (...)   Not sure
 whether they do it also for Javascript, since there compilation times
 have to be very fast, but I guess they did so in their Java compiler.

We are not very good at mentioning existing research, but at least for
the case of tracing JITs I think we know pretty much everything
published, which you might find by googling for tracing JIT.  (It's
always a better approach than doing guesses in an unrelated project's
mailing list.)

For how PyPy's tracing JIT compares to existing approaches, there is a
PyPy paper at:

http://codespeak.net/svn/pypy/extradoc/talk/icooolps2009/

As well as the start of a draft about virtuals at:

http://codespeak.net/svn/pypy/extradoc/talk/s3-2010/

And you should not miss Benjamin's great summary at:

http://codespeak.net/pypy/dist/pypy/doc/jit/pyjitpl5.html


A bientot,

Armin.
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev


Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Antonio Cuni
On 03/07/10 08:14, Hakan Ardo wrote:

 What is a bridge?

you might be interested to read the chapter of my PhD thesis which explains 
exactly that, with diagrams:
http://codespeak.net/svn/user/antocuni/phd/thesis/thesis.pdf

In particular, section 6.4 explains the difference between loops, bridges and 
entry bridges.

ciao,
Anto
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev


Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Leonardo Santagada

On Jul 3, 2010, at 3:58 AM, Paolo Giarrusso wrote:

 Another issue: what is i4 for? It's not used at all in the loop, but
 it is reset to 27 at the end of it, each time. Doesn't such a var
 waste some (little) time?

This I found interesting. Do anyone know the answer?

--
Leonardo Santagada
santagada at gmail.com



___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev


Re: [pypy-dev] Interpreter level array implementation

2010-07-03 Thread Paolo Giarrusso
On Sat, Jul 3, 2010 at 09:28, Armin Rigo ar...@tunes.org wrote:
 Hi Paolo,

 On Sat, Jul 03, 2010 at 08:58:34AM +0200, Paolo Giarrusso wrote:
 Isn't there some existing research about that in the 'tracing'
 community?  (...)   Not sure
 whether they do it also for Javascript, since there compilation times
 have to be very fast, but I guess they did so in their Java compiler.

 We are not very good at mentioning existing research, but at least for
 the case of tracing JITs I think we know pretty much everything
 published, which you might find by googling for tracing JIT.  (It's
 always a better approach than doing guesses in an unrelated project's
 mailing list.)

If you had read the next sentence you'd have found out that I did read
some papers about that (where I learned about trace trees). My guess
was just about whether their Java compiler used trace trees or the
other possibility, i.e., trace stitching (as I now learned).

But thanks for the references, I'll have a look later.
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev

[pypy-dev] Interpreter level array implementation

2010-07-02 Thread Hakan Ardo
Hi,
we got the simplest possible interpreter level implementation of an
array-like object running (in the interplevel-array branch) and it
executes my previous example about 2 times slower than optimized C.
Attached is the trace generated by the following example:

img=array(640*480);   l=0;   i=0;
while i640*480:
l+=img[i]
i+=1

a simplified version of that trace is:

   1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i]
   2. i14 = int_lt(i, 307200)
   3.   guard_true(i14, descr=Guard1)
   4.   guard_nonnull_class(p10, 145745952, descr=Guard2)
   5. img = getfield_gc(p10, descr=GcPtrFieldDescr 8)
   6. f17 = getarrayitem_gc(img, i, descr=FloatArrayDescr)
   7. f18 = float_add(f11, f17)
   8. i20 = int_add_ovf(i, 1)
   9.   guard_no_overflow(, descr=Guard3) #
  10. i23 = getfield_raw(149604768, descr=SignedFieldDescr 0)
  11. i25 = int_add(i23, 1)
  12. setfield_raw(149604768, i25, descr=SignedFieldDescr 0)
  13. i28 = int_and(i25, -2131755008)
  14. i29 = int_is_true(i28)
  15.   guard_false(i29, descr=Guard4)
  16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32),
   ConstPtr(ptr33), p8, p9, p10, f18, i20)

Does these operation more or less correspond to assembler
instructions? I guess that the extra overhead here as compared to the
the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about?
I guess that most of these additional operation would not affect the
performance of more complicated loops as they will only occur once per
loop (although combining the guard on line 9 with line 3 might be a
possible optimization)? Line 4 will appear once for each array used in
the loop and line 5 once for every array access, right?

Can the array implementation be designed in someway that would not
generate line 5 above? Or would it be possible to get rid of it by
some optimization?

-- 
Håkan Ardö


log
Description: Binary data
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interpreter level array implementation

2010-07-02 Thread Alex Gaynor
On Fri, Jul 2, 2010 at 2:59 PM, Hakan Ardo ha...@debian.org wrote:
 Hi,
 we got the simplest possible interpreter level implementation of an
 array-like object running (in the interplevel-array branch) and it
 executes my previous example about 2 times slower than optimized C.
 Attached is the trace generated by the following example:

    img=array(640*480);   l=0;   i=0;
    while i640*480:
        l+=img[i]
        i+=1

 a simplified version of that trace is:

   1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i]
   2. i14 = int_lt(i, 307200)
   3.   guard_true(i14, descr=Guard1)
   4.   guard_nonnull_class(p10, 145745952, descr=Guard2)
   5. img = getfield_gc(p10, descr=GcPtrFieldDescr 8)
   6. f17 = getarrayitem_gc(img, i, descr=FloatArrayDescr)
   7. f18 = float_add(f11, f17)
   8. i20 = int_add_ovf(i, 1)
   9.   guard_no_overflow(, descr=Guard3) #
  10. i23 = getfield_raw(149604768, descr=SignedFieldDescr 0)
  11. i25 = int_add(i23, 1)
  12. setfield_raw(149604768, i25, descr=SignedFieldDescr 0)
  13. i28 = int_and(i25, -2131755008)
  14. i29 = int_is_true(i28)
  15.   guard_false(i29, descr=Guard4)
  16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32),
           ConstPtr(ptr33), p8, p9, p10, f18, i20)

 Does these operation more or less correspond to assembler
 instructions? I guess that the extra overhead here as compared to the
 the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about?
 I guess that most of these additional operation would not affect the
 performance of more complicated loops as they will only occur once per
 loop (although combining the guard on line 9 with line 3 might be a
 possible optimization)? Line 4 will appear once for each array used in
 the loop and line 5 once for every array access, right?

 Can the array implementation be designed in someway that would not
 generate line 5 above? Or would it be possible to get rid of it by
 some optimization?

 --
 Håkan Ardö

 ___
 pypy-dev@codespeak.net
 http://codespeak.net/mailman/listinfo/pypy-dev


In addition to the things you noted, I guess the int overflow check
can be optimized out, since i+=1 can never cause it to overflow given
that i is bounded at 640*480.  I suppose in general that would require
more dataflow analysis.

Alex

-- 
I disapprove of what you say, but I will defend to the death your
right to say it. -- Voltaire
The people's good is the highest law. -- Cicero
Code can always be simpler than you think, but never as simple as you
want -- Me
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interpreter level array implementation

2010-07-02 Thread Maciej Fijalkowski
General note - we consider 2x optimized C a pretty good result :) Details below

On Fri, Jul 2, 2010 at 1:59 PM, Hakan Ardo ha...@debian.org wrote:
 Hi,
 we got the simplest possible interpreter level implementation of an
 array-like object running (in the interplevel-array branch) and it
 executes my previous example about 2 times slower than optimized C.
 Attached is the trace generated by the following example:

    img=array(640*480);   l=0;   i=0;
    while i640*480:
        l+=img[i]
        i+=1

 a simplified version of that trace is:

   1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i]
   2. i14 = int_lt(i, 307200)
   3.   guard_true(i14, descr=Guard1)
   4.   guard_nonnull_class(p10, 145745952, descr=Guard2)
   5. img = getfield_gc(p10, descr=GcPtrFieldDescr 8)
   6. f17 = getarrayitem_gc(img, i, descr=FloatArrayDescr)
   7. f18 = float_add(f11, f17)
   8. i20 = int_add_ovf(i, 1)
   9.   guard_no_overflow(, descr=Guard3) #
  10. i23 = getfield_raw(149604768, descr=SignedFieldDescr 0)
  11. i25 = int_add(i23, 1)
  12. setfield_raw(149604768, i25, descr=SignedFieldDescr 0)
  13. i28 = int_and(i25, -2131755008)
  14. i29 = int_is_true(i28)
  15.   guard_false(i29, descr=Guard4)
  16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32),
           ConstPtr(ptr33), p8, p9, p10, f18, i20)

 Does these operation more or less correspond to assembler
 instructions?

Yes. Use PYPYJITLOG=log pypy-c ... to get assembler. View using
pypy/jit/backend/x86/tool/viewcode.py

  I guess that the extra overhead here as compared to the
 the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about?

It's about a couple of things that python interpreter has to perform.
Notably asynchronous signal checking and thread swapping with GIL.

 I guess that most of these additional operation would not affect the
 performance of more complicated loops as they will only occur once per
 loop (although combining the guard on line 9 with line 3 might be a
 possible optimization)? Line 4 will appear once for each array used in
 the loop and line 5 once for every array access, right?

Yes. We don't do loop invariant optimizations for some reasons, the
best of it being the fact that to loop you can always add a bridge
which will invalidate this invariant.


 Can the array implementation be designed in someway that would not
 generate line 5 above? Or would it be possible to get rid of it by
 some optimization?

No, it's about optimizations of JIT itself (it's an artifact of python
looping rather than array module).


 --
 Håkan Ardö

 ___
 pypy-dev@codespeak.net
 http://codespeak.net/mailman/listinfo/pypy-dev


Cheers,
fijal
___
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev