Hello Skip, On Mon, 22 Mar 2021 17:13:19 -0500 Skip Montanaro <skip.montan...@gmail.com> wrote:
> Thanks for the response. I will try to address your comments inline. > > > I guess it should be a good idea to answer what's the scope of this > > project - is it research one or "production" one? If it's research > > one, why be concerned with the churn of over-modern CPython > > versions? Wouldn't it be better to just use some scalable, > > incremental implementation which would allow to forward-port it to > > a newer version, if it ever comes to that? > > The motivation for revisiting this idea was/is largely personal. Thanks for putting it like that, that's the impression I also got (and I'm "familiar" (heard about) this project for a few years) ;-). But you're now looking for potential contributors, so it may be a good idea to (better) explain in "pep" what would motivate them to join in. > As I > indicated, I first messed around with it over 20 years ago and it's > been in the back of my mind ever since. Somehow I never lost the code > despite I'm not sure how many computers came and went and that the > code was never uploaded to any sort of distributed version control > system. I decided to pick things up again as a way to mostly keep my > head in the game after I retired. So, neither "research" nor > "production" seems to be a correct descriptor. I guess "research" fits right in, unless you really want to categorize it as a "personal quest". But then, unclear how request for contributors would fit in with that. > Still, if taken to > functional completion — functional enough for performance testing and > application to more than just toy scripts — I realized pretty quickly > that I'd need help. > > > Otherwise, if it's "production", who's the "customer" and how they > > "compensate" you for doing work (chasing the moving target) which is > > clearly of little interest to you and conflicts with the goal of the > > project? > > Nobody is compensating me. I have no desire to try and turn it into > something I do for hire. Maybe I misunderstood your question? Yes, I essentially try to hint that if: a) You're working with CPython bleeding edge. b) You find that (bleeding edge) adding extra chore. c) Nobody told you to work on bleeding (nor boss, nor a maintainer who said "I'll merge it once you've done"), Then: why do you complicate your task by working on bleeding edge? Could take not too old CPython version, e.g. 3.8/3.9, instead, and work with that. > > > This PEP proposes the addition of register-based instructions to > > > the existing Python virtual machine, with the intent that they > > > eventually replace the existing stack-based opcodes. > > > > Sorry, what? The purpose of register-based instructions is to just > > replace stack-based instructions? That's not what's I'd like to > > hear as the intro phrase. You probably want to replace one with the > > other because register-based ones offer some benefit, faster > > execution perhaps? That's what I'd like to hear instead of > > "deciphering" that between the lines. > > Replacing stack-based instructions would be a reasonable initial goal, > I think. Victor reported performance improvements in his > implementation (also a translator). Btw, from just "pep", it's unclear if, and how much, you reuse Victor's work. If not, why? (The answer is useful to contributors - you ask them to "reuse" your code - how it's regarding your reuse of code of folks who were before you?). > As I indicated in the "PEP" (I use > that term rather loosely, as I have no plans at the moment to submit > it for consideration, certainly not in its current, incomplete state), Sure, I understand. If anything, I find big similarities between your situation and mine - I also have some ideas (but different) regarding changes (improvements I think) to Python runtime (not just VM), and I also write "pseudoPEPs" about them seeking for feedback, etc. > a better ultimate way to go would be to generate register instructions > directly from the AST. Yeah, potentially better ultimate way. Note that advanced (JIT, etc) Java JVMs do well by starting from standard stack-based Java bytecode. [] > > explicitly. Would be interesting to read (in the following "pep" > > sections) what makes them "almost completely distinct". > > Well, sure. The main difference is the way two pairs of instructions > (say, BINARY_ADD vs BINARY_ADD_REG) get their operands and save their > result. You still have to be able to add two objects, call functions, > etc. I'd note that your reply skips answering question about calling convention for register-based VM, and that's again one of the most important questions (and one I'd be genuinely interested to hear). [] > The fact that I treat the current frame's stack space as registers > makes it pretty much impossible to execute both stack and register > instructions within the same frame. I don't see how that would be true (in general, I understand that you may have constraints re: that, but that's exactly why I bring that up - why do you have constraints like that?). Even existing Python VM allows to use both in the same frame, e.g. LOAD_FAST. It takes value of register and puts it on a stack. Do you mean details like need to translate stack-based instructions into 2 (or more) instructions of: a) actual register-register instruction and; b) stack pointer adjustment, so stack-based instructions still kept working? Yes, you would need to do that, until you fully switch to register-based ones. But then there's 2 separate tasks: 1. Make register VM work. (Should be medium complexity.) 2. Make it fast. (Likely will be hard.) If you want to achieve both right from the start - oh-oh, that may be double-hard. > Victor's implementation did things > differently in this regard. I believe he just allocated extra space > for 256 registers at the end of each frame, so (in theory, I suppose), > you could have instructions from both executed in the same frame. I hope you have a plan of how to deal with more than 256 registers, etc. Register VM adds a lot of accidental implementation complexity ;-). > > > ## Motivation > > > > I'm not sure the content of the section corresponds much to its > > title. It jumps from background survey of the different Python VM > > optimizations to (some) implementation details of register VM - > > leaving "motivation" somewhere "between the lines". > > > > > Despite all that effort, opcodes which do nothing more than move > > > data onto or off of the stack (LOAD_FAST, LOAD_GLOBAL, etc) still > > > account for nearly half of all opcodes executed. > > > > ... And - you intend to change that with a register VM? In which > > way and how? As an example, LOAD_GLOBAL isn't going anywhere - it > > loads a variable by *symbolic* name into a register. > > Certainly, if you have data which isn't already on the stack, you are > going to have to move data. Even if you have data in registers, you still may need to move it around to accommodate special conventions of some instructions. > As the appendix shows though, a fairly > large chunk of the current virtual machine does nothing more than > manipulate the stack (LOAD_FAST, STORE_FAST, POP_TOP, etc). > > > > Running Pyperformance using a development version of Python 3.9 > > > showed that the five most frequently executed pure stack opcodes > > > (LOAD_FAST, STORE_FAST, POP_TOP, DUP_TOP and ROT_TWO) accounted > > > for 35% of all executed instructions. > > > > And you intend to change that with a register VM? How? > > I modified the frame so that (once again) the current local variable > space adjoins the current stack space (which, again, I treat as > registers). The virtual machine can thus access local variables in > place. For simple instructions, yes. What about instructions with multiple (arbitrarily number) arguments, like CALL_FUNCTION*/CALL_METHOD*, BUILD_*, etc. (That's the same question as asked in the first mail and not answered here.) > In retrospect, I suspect it might not have been necessary. > During the current phase where I've yet to implement any *_DEREF_REG > instructions, it would be a moot point though. Still, I'm not sure the > cell/free slots have the same semantics as locals/stack (an area of my > expertise which is lacking). Isn't there an extra level of indirection > there? In any case, if the cell/free slots are semantically just like > locals, then it would be straightforward for me to restore the order > of the data blocks in frames. Yes, that's the point - semantically they're just locals, even though they are usually accessed with an extra level of indirection. > > Quick google search leads to > > https://www.strchr.com/x86_machine_code_statistics (yeah, that's not > > VM, it's RM (real machine), stats over different VMs would be > > definitely welcome): > > > > > The most popular instruction is MOV (35% of all instructions). > > > > So, is the plan to replace 35% of "five most frequently executed > > pure stack opcodes" with 35% of register-register move > > instructions? If not, why it would be different and how would you > > achieve that? > > I have clearly not explained myself very well in the "PEP". Well, it seems to be written with an idea that a reader is already familiar with the benefits of register-based VMs. As a fresh reader, I tried to point out that fact. I also happen to be familiar with those benefits, and the fact that "on average", register-based VMs are faster than stack-based. But that only makes me ask why do you think that those "on average" benefits would apply to Python VM case, and ask for additional details regarding more complex cases with it (which would IMHO noticeably cancel any RVM benefits, unless you have a cunning, well-grounded plan to deal with them). > I will > rework that section. Still though, local variables and stack space are > adjacent in my implementation, so local variables can be addressed > directly without needing to first copy them onto the stack (or into a > register). Only for simple operations. Where idyllic picture for RISC CPUs with their uniform register files breaks is function calling (Python analog: CALL_FUNCTION). For not purely RISC CPUs, additional put-down are instructions with adhoc register constraints (Python analog would be BUILD_LIST, etc.) > Clearly, global variables must still be copied into and out > of registers to manipulate. I may well replicate the code object's > constants in the frame as well so they can also be treated as > (read-only) registers. Since FrameObjects are cached for reuse with > the same code object, the cost to copy them should be bearable. > > > > They are low-cost instructions (compared with CALL_FUNCTION for > > > example), but still eat up time and space in the virtual machine > > > > But that's the problem of any VM - it's slow by definition. There > > can be less slow and more slow VMs, but VMs can't be fast. So, > > what's the top-level motivation - is it "making CPython fast" or > > "making CPython a little bit less slow"? By how much? > > The top-level motivation is to have fun. Need there be anything more > ambitious? Well, my point is that if you ask for contributors, it's fair to be fair with them ;-). If you ask them to join the fun, it's fair, and then it makes sense to maximize level of fun and minimize level of chore (no bleeding edge chasing, minimize changes overall, use incremental development, where you always have something working, not "we'll get that working in a year if we pull well every day"). Otherwise, maybe it would be useful to have more objective criteria/goal, like "let's try to make Python faster", and then it makes sense to explain why you think register VM would be faster in general, and in Python case specifically. (The work process can still be fun-optimized, chore-minimizing ;-) ). > Still, point taken. Had I continued to pursue this back in > the early 2000s, or had Victor succeeded in getting his implementation > into the core, we might be much further along. The current problem is > made all the more difficult by the fact that the virtual machine has > grown so much in the past 20+ years. > > > > Consider the layout of the data section of a Frame object: > > > All those LOAD_FAST and STORE_FAST instructions just copy pointers > > > between chunks of RAM which are just a few bytes away from each > > > other in memory. > > > > Ok, but LOAD_DEREF and STORE_DEREF instructions also just copy > > pointers (with extra dereferencing, but that's a detail). It's > > unclear why you try to ignore them ("cell" registers), putting > > ahead "locals" and "stack" registers. The actual register > > instructions implementation would just treat any frame slot as a > > register with continuous numbering, allowing to access all of > > locals, cells, and stack locs in the same way. In that regard, > > trying to rearrange 3 groups at this stage seems like rather > > unneeded implementation complexity with no clear motivation. > > I haven't even looked at LOAD_DEREF or STORE_DEREF yet. I think that > extra dereferencing will be more than a simple detail though. That > makes the semantics of cell/free slots different than locals/registers > slots (more like globals). If true, then my reordering of the frame > data is worthwhile, I think. Until you have benchmarking data that proves that on a *specific CPU* one layout is 0.39% faster than the other (will be different and even opposite on another arch/CPU), it's all guesswork, I'm afraid. And extra changed code to maintain, which spoils the fun ;-). > > Skip -- Best regards, Paul mailto:pmis...@gmail.com _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6K6PBYPITIJ5SA4YWZCGBJR2FKZMPHXX/ Code of Conduct: http://python.org/psf/codeofconduct/