Dave Thanks for thoughts.
My intent is to use non-custom barriers and measure scalability, customising if need be. Not busy-waiting the go routines leaves processor time for other stuff (like visualization, looking for patterns, etc) There’s also the load-balancing effect; code pretending to be a core fetching an instruction from an icache and waiting for registers to be available has longer path-length than a simple memory access, so one has to avoid the whole set waiting on the longest pole. Simulating vector operations brings with it similar problems. The ‘batch the instructions’ thing is standard practice even in sequential simulators, where the overheads of walking the list, finding the data structure for the core, and fetching an instruction are sufficient that if you let the simulated core fetch-and-execute 10, 100, or 1000 instructions you’ll get usefully improving performance. This has to be traded off agains the fact that multicore computer systems have lots of contention, and if you batch stuff up you risk losing sight of the details of that contention (which means the utility of the model can be iffy). This can be attacked by ‘variable batches’ and letting other engines catch up, etc etc, but it all makes it more complicated (which is unwise). So assuring oneself of good scalability with instruction at a time is good ue diligence. IF this works as desired, it should turn out to be open source. I’m a novice at writing go, so first pass will be as simple as practical rather than polished idiomatic go I’m also considering making a ‘pure go’ version - a goroutine is an object - one to one. They’re clocked. They communicate by channels. This would be for folk who want to ‘model a system architecture’ more or less directly dropping their vision into go, rather than having to create weird data structures. Given such a thing, it would not be infeasible to create the faster (well, hopefully) model automagically. [Did this way back in the early 90’s, using a homebrew language which introduced the idea of ‘clocked channels'. Unfortunately I wasn’t very good at writing compilers and the tool was… unstable] — P > On Jan 14, 2021, at 9:33 AM, David Riley <fraveyd...@gmail.com> wrote: > > On Jan 13, 2021, at 7:21 PM, Peter Wilson <peter.wil...@bsc.es> wrote: >> So, after a long ramble, given that I am happy to waste CPU time in busy >> waits (rather than have the overhead of scheduling blocked goroutines), what >> is the recommendation for the signalling mechanism when all is done in go >> and everything's a goroutine, not a thread? > > This is similar to something I'm working on for logic simulation, and I'd > been thinking about the clocked simulation as well. I'll be interested in > your results; since I'm also considering remote computation (and GPU > computation, which might as well be remote) I'm currently going with the idea > of futures driven by either channels or sync.Cond. That may not be as > efficient for your use case. > >> My guess is that creating specialist blocking 'barriers' using sync/atomic >> (atomic.Operation seems to be around 4nsec on my Mac Mini) is the highest >> performance mechanism. There's a dearth of performance information on >> channel communication, waitgroup, mutex etc use, but those I have seen seem >> to suggest that sending/receiving on a channel might be over the order of >> 100nsec; since in C we iterate twice through the list in 30-40nsec, this is >> a tad high (yes, fixeable by modeling a bigger system, but) > > My advice would be to implement the easiest method possible that's not likely > to box you in and profile it and see where your bottlenecks are. In my case, > so far, the delays introduced by IPC mechanisms (and also allocations) is > absolutely dwarfed by just the "business logic" crunching the logical > primitives. So far it's not worth trying to improve the IPC on the order of > nanoseconds (would be a nice problem to have) because the work done in each > "chunk" is big enough that it's not worth worrying about. > > This also leads me to the next part, which is that if you have lots of little > operations and you're worried about the time spent on IPC for each little > thing, you'll probably get the easiest and best performance gains by trying > to batch them so that you can burn through lots of similar operations at once > before trying to send a slice over a channel or something. > > As always, do a POC implementation and then profile it. That's the only > productive way to optimize things at this scale, and Go has EXCELLENT > profiling capabilities built in. > > > - Dave > > -- > You received this message because you are subscribed to a topic in the Google > Groups "golang-nuts" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/golang-nuts/yqxfGIGDKr4/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > golang-nuts+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/24427C92-66CF-4515-ADB4-A3E96059380C%40gmail.com. WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer http://bsc.es/disclaimer -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/FBC7B5C2-EADA-4739-AE64-2B7EEA2508A4%40bsc.es.