Re: [pypy-dev] Interpreter level array implementation
Hi Carl, first, thanks for reading and for your explanation. On Sat, Jul 3, 2010 at 10:03, Carl Friedrich Bolz cfb...@gmx.de wrote: 2010/7/3 Paolo Giarrusso p.giarru...@gmail.com: It requires thinking. It's harder to do because we don't know statically upfront how many paths we'll compile to assembler, but I can think about ways to mitigate that. Isn't there some existing research about that in the 'tracing' community? As far as I remember, the theory is that traces are assembled in trace trees, and that each time a (simplified*) SSA optimization pass is applied to the trace tree to compile it. Not sure whether they do it also for Javascript, since there compilation times have to be very fast, but I guess they did so in their Java compiler. There are two ways to deal with attaching now traces to existing ones. On the one hand there are trace trees, which recompile the whole tree of traces when a new one is added. This can be costly. On the other hand, there is trace stitching, which just patches the existing trace to jump to the new one. PyPy (and TraceMonkey, I think) uses trace stitching. For TraceMonkey, response times suggest the usage of trace stitching. The original Java compiler used trace trees. But if I have a Python application server, I'm probably willing to accept the bigger compilation time, especially if compilation is performed by a background thread. Would it be possible to accommodate this case? The problem with loop-invarian code motion is that when you stitch in a new trace (what we call a bridge) it is not clear that the code that was invariant so far is invariant on the new path as well. I see - but what about noting potential modifications to the involved objects and invalidating the old traces, similarly to how classloading invalidates other optimizations? Of course, some heuristics and tuning would be needed I guess, since I expect that invalidations here would be much more frequent otherwise. Such heuristics would probably approximate a solution to the problem mentioned by Maciej: It requires thinking. It's harder to do because we don't know statically upfront how many paths we'll compile to assembler, but I can think about ways to mitigate that. However, I still wonder how easy it is to recognize a potential write. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski fij...@gmail.com wrote: General note - we consider 2x optimized C a pretty good result :) Details below As do I :) I just want to make this as jit-friendly as possible without rely knowing what's jit-friendly... Yes. We don't do loop invariant optimizations for some reasons, the best of it being the fact that to loop you can always add a bridge which will invalidate this invariant. Are you telling me that you probably never will include that kind of optimization because of the limitations it imposes on other parts of the jit or just that it would be a lot of work to get it in place? What is a bridge? -- Håkan Ardö ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
On Sat, Jul 3, 2010 at 12:14 AM, Hakan Ardo ha...@debian.org wrote: On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski fij...@gmail.com wrote: General note - we consider 2x optimized C a pretty good result :) Details below As do I :) I just want to make this as jit-friendly as possible without rely knowing what's jit-friendly... I think it's fairly JIT friendly. You can look into traces (as you did), but seems fine to me. Yes. We don't do loop invariant optimizations for some reasons, the best of it being the fact that to loop you can always add a bridge which will invalidate this invariant. Are you telling me that you probably never will include that kind of optimization because of the limitations it imposes on other parts of the jit or just that it would be a lot of work to get it in place? It requires thinking. It's harder to do because we don't know statically upfront how many paths we'll compile to assembler, but I can think about ways to mitigate that. What is a bridge? If guard fails often enough, it's traced and compiled to assembler. That's a bridge -- Håkan Ardö ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
On Sat, Jul 3, 2010 at 08:20, Maciej Fijalkowski fij...@gmail.com wrote: On Sat, Jul 3, 2010 at 12:14 AM, Hakan Ardo ha...@debian.org wrote: On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski fij...@gmail.com wrote: General note - we consider 2x optimized C a pretty good result :) Details below As do I :) I just want to make this as jit-friendly as possible without rely knowing what's jit-friendly... I think it's fairly JIT friendly. You can look into traces (as you did), but seems fine to me. Yes. We don't do loop invariant optimizations for some reasons, the best of it being the fact that to loop you can always add a bridge which will invalidate this invariant. Are you telling me that you probably never will include that kind of optimization because of the limitations it imposes on other parts of the jit or just that it would be a lot of work to get it in place? It requires thinking. It's harder to do because we don't know statically upfront how many paths we'll compile to assembler, but I can think about ways to mitigate that. Isn't there some existing research about that in the 'tracing' community? As far as I remember, the theory is that traces are assembled in trace trees, and that each time a (simplified*) SSA optimization pass is applied to the trace tree to compile it. Not sure whether they do it also for Javascript, since there compilation times have to be very fast, but I guess they did so in their Java compiler. Also, in other cases the general JIT approach is 'optimize and invalidate if needed'. For instance, if a Java class has no subclass, it's not safe to assume this will hold forever to perform optimization; but the optimization is performed and a hook is installed so that class loading will undo the optimization. Another issue: what is i4 for? It's not used at all in the loop, but it is reset to 27 at the end of it, each time. Doesn't such a var waste some (little) time? * SSA on trace trees took advantage of their simpler structure compared to graphs for some operations. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
Hi Alex, On Fri, Jul 02, 2010 at 03:12:19PM -0500, Alex Gaynor wrote: In addition to the things you noted, I guess the int overflow check can be optimized out, since i+=1 can never cause it to overflow given that i is bounded at 640*480. I suppose in general that would require more dataflow analysis. Hakan mentioned this. It's actually an easy optimization in our linear code; I guess I will give it a try. A bientot, Armin. ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
Hi Paolo, On Sat, Jul 03, 2010 at 08:58:34AM +0200, Paolo Giarrusso wrote: Isn't there some existing research about that in the 'tracing' community? (...) Not sure whether they do it also for Javascript, since there compilation times have to be very fast, but I guess they did so in their Java compiler. We are not very good at mentioning existing research, but at least for the case of tracing JITs I think we know pretty much everything published, which you might find by googling for tracing JIT. (It's always a better approach than doing guesses in an unrelated project's mailing list.) For how PyPy's tracing JIT compares to existing approaches, there is a PyPy paper at: http://codespeak.net/svn/pypy/extradoc/talk/icooolps2009/ As well as the start of a draft about virtuals at: http://codespeak.net/svn/pypy/extradoc/talk/s3-2010/ And you should not miss Benjamin's great summary at: http://codespeak.net/pypy/dist/pypy/doc/jit/pyjitpl5.html A bientot, Armin. ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
On 03/07/10 08:14, Hakan Ardo wrote: What is a bridge? you might be interested to read the chapter of my PhD thesis which explains exactly that, with diagrams: http://codespeak.net/svn/user/antocuni/phd/thesis/thesis.pdf In particular, section 6.4 explains the difference between loops, bridges and entry bridges. ciao, Anto ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
On Jul 3, 2010, at 3:58 AM, Paolo Giarrusso wrote: Another issue: what is i4 for? It's not used at all in the loop, but it is reset to 27 at the end of it, each time. Doesn't such a var waste some (little) time? This I found interesting. Do anyone know the answer? -- Leonardo Santagada santagada at gmail.com ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
On Sat, Jul 3, 2010 at 09:28, Armin Rigo ar...@tunes.org wrote: Hi Paolo, On Sat, Jul 03, 2010 at 08:58:34AM +0200, Paolo Giarrusso wrote: Isn't there some existing research about that in the 'tracing' community? (...) Not sure whether they do it also for Javascript, since there compilation times have to be very fast, but I guess they did so in their Java compiler. We are not very good at mentioning existing research, but at least for the case of tracing JITs I think we know pretty much everything published, which you might find by googling for tracing JIT. (It's always a better approach than doing guesses in an unrelated project's mailing list.) If you had read the next sentence you'd have found out that I did read some papers about that (where I learned about trace trees). My guess was just about whether their Java compiler used trace trees or the other possibility, i.e., trace stitching (as I now learned). But thanks for the references, I'll have a look later. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
[pypy-dev] Interpreter level array implementation
Hi, we got the simplest possible interpreter level implementation of an array-like object running (in the interplevel-array branch) and it executes my previous example about 2 times slower than optimized C. Attached is the trace generated by the following example: img=array(640*480); l=0; i=0; while i640*480: l+=img[i] i+=1 a simplified version of that trace is: 1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i] 2. i14 = int_lt(i, 307200) 3. guard_true(i14, descr=Guard1) 4. guard_nonnull_class(p10, 145745952, descr=Guard2) 5. img = getfield_gc(p10, descr=GcPtrFieldDescr 8) 6. f17 = getarrayitem_gc(img, i, descr=FloatArrayDescr) 7. f18 = float_add(f11, f17) 8. i20 = int_add_ovf(i, 1) 9. guard_no_overflow(, descr=Guard3) # 10. i23 = getfield_raw(149604768, descr=SignedFieldDescr 0) 11. i25 = int_add(i23, 1) 12. setfield_raw(149604768, i25, descr=SignedFieldDescr 0) 13. i28 = int_and(i25, -2131755008) 14. i29 = int_is_true(i28) 15. guard_false(i29, descr=Guard4) 16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32), ConstPtr(ptr33), p8, p9, p10, f18, i20) Does these operation more or less correspond to assembler instructions? I guess that the extra overhead here as compared to the the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about? I guess that most of these additional operation would not affect the performance of more complicated loops as they will only occur once per loop (although combining the guard on line 9 with line 3 might be a possible optimization)? Line 4 will appear once for each array used in the loop and line 5 once for every array access, right? Can the array implementation be designed in someway that would not generate line 5 above? Or would it be possible to get rid of it by some optimization? -- Håkan Ardö log Description: Binary data ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
On Fri, Jul 2, 2010 at 2:59 PM, Hakan Ardo ha...@debian.org wrote: Hi, we got the simplest possible interpreter level implementation of an array-like object running (in the interplevel-array branch) and it executes my previous example about 2 times slower than optimized C. Attached is the trace generated by the following example: img=array(640*480); l=0; i=0; while i640*480: l+=img[i] i+=1 a simplified version of that trace is: 1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i] 2. i14 = int_lt(i, 307200) 3. guard_true(i14, descr=Guard1) 4. guard_nonnull_class(p10, 145745952, descr=Guard2) 5. img = getfield_gc(p10, descr=GcPtrFieldDescr 8) 6. f17 = getarrayitem_gc(img, i, descr=FloatArrayDescr) 7. f18 = float_add(f11, f17) 8. i20 = int_add_ovf(i, 1) 9. guard_no_overflow(, descr=Guard3) # 10. i23 = getfield_raw(149604768, descr=SignedFieldDescr 0) 11. i25 = int_add(i23, 1) 12. setfield_raw(149604768, i25, descr=SignedFieldDescr 0) 13. i28 = int_and(i25, -2131755008) 14. i29 = int_is_true(i28) 15. guard_false(i29, descr=Guard4) 16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32), ConstPtr(ptr33), p8, p9, p10, f18, i20) Does these operation more or less correspond to assembler instructions? I guess that the extra overhead here as compared to the the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about? I guess that most of these additional operation would not affect the performance of more complicated loops as they will only occur once per loop (although combining the guard on line 9 with line 3 might be a possible optimization)? Line 4 will appear once for each array used in the loop and line 5 once for every array access, right? Can the array implementation be designed in someway that would not generate line 5 above? Or would it be possible to get rid of it by some optimization? -- Håkan Ardö ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev In addition to the things you noted, I guess the int overflow check can be optimized out, since i+=1 can never cause it to overflow given that i is bounded at 640*480. I suppose in general that would require more dataflow analysis. Alex -- I disapprove of what you say, but I will defend to the death your right to say it. -- Voltaire The people's good is the highest law. -- Cicero Code can always be simpler than you think, but never as simple as you want -- Me ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Re: [pypy-dev] Interpreter level array implementation
General note - we consider 2x optimized C a pretty good result :) Details below On Fri, Jul 2, 2010 at 1:59 PM, Hakan Ardo ha...@debian.org wrote: Hi, we got the simplest possible interpreter level implementation of an array-like object running (in the interplevel-array branch) and it executes my previous example about 2 times slower than optimized C. Attached is the trace generated by the following example: img=array(640*480); l=0; i=0; while i640*480: l+=img[i] i+=1 a simplified version of that trace is: 1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i] 2. i14 = int_lt(i, 307200) 3. guard_true(i14, descr=Guard1) 4. guard_nonnull_class(p10, 145745952, descr=Guard2) 5. img = getfield_gc(p10, descr=GcPtrFieldDescr 8) 6. f17 = getarrayitem_gc(img, i, descr=FloatArrayDescr) 7. f18 = float_add(f11, f17) 8. i20 = int_add_ovf(i, 1) 9. guard_no_overflow(, descr=Guard3) # 10. i23 = getfield_raw(149604768, descr=SignedFieldDescr 0) 11. i25 = int_add(i23, 1) 12. setfield_raw(149604768, i25, descr=SignedFieldDescr 0) 13. i28 = int_and(i25, -2131755008) 14. i29 = int_is_true(i28) 15. guard_false(i29, descr=Guard4) 16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32), ConstPtr(ptr33), p8, p9, p10, f18, i20) Does these operation more or less correspond to assembler instructions? Yes. Use PYPYJITLOG=log pypy-c ... to get assembler. View using pypy/jit/backend/x86/tool/viewcode.py I guess that the extra overhead here as compared to the the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about? It's about a couple of things that python interpreter has to perform. Notably asynchronous signal checking and thread swapping with GIL. I guess that most of these additional operation would not affect the performance of more complicated loops as they will only occur once per loop (although combining the guard on line 9 with line 3 might be a possible optimization)? Line 4 will appear once for each array used in the loop and line 5 once for every array access, right? Yes. We don't do loop invariant optimizations for some reasons, the best of it being the fact that to loop you can always add a bridge which will invalidate this invariant. Can the array implementation be designed in someway that would not generate line 5 above? Or would it be possible to get rid of it by some optimization? No, it's about optimizations of JIT itself (it's an artifact of python looping rather than array module). -- Håkan Ardö ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev Cheers, fijal ___ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev