Hi Shawn, On Sun, Jul 15, 2012 at 12:00 PM, Shawn Morel <shawnmo...@me.com> wrote: > >> The runtime also is designed to minimize L1 cache misses, more on that >> if there is interest. > > I would be interested in some of the details.
I'll write more about this in a separate email. >> Regarding the Nile C runtime, inter-thread communication is currently >> not a bottleneck. > > what are the current Nile bottlenecks? It depends on the Nile program. For several graphics pipelines, the SortBy stage (a runtime-supplied Nile process) often takes a good part of the time. This is because rasterization often involves sorting many stream elements, twice. I have a handful of ideas for optimizing SortBy that I just haven't implemented yet. The important point here though, is that most often the performance profiles show that most time is spent performing the essential computations of the Nile program, rather than incidental runtime work. >> Queuing is a relatively infrequent >> occurrence (compared to computation and stream data read/writing), > > How is that so, I would have assumed that joins in the stream processor > effectively become a reader writer problem. I'm not sure I understand the question, but I'll try to answer anyway. The Nile model is about stream processing (rather than say, fine-grained/reactive dataflow). So the data is processed in batches (i.e., data is buffered). Queuing/dequeuing stream data is done at batch granularity, rather than individual data element granularity. Thus, queuing is a relatively infrequent occurrence compared to regular computation and individual stream element reading/writing. Regarding joins in the process network, a single process performs the work of combining the two incoming streams. So we have basically a two-writer-one-reader scenario, but the two writers do not share a queue. The joining process (the reader) pulls from two queues, and produces output on a single queue. It might help to know that joins in Nile are limited to zipping (combining input elements from the two streams in an alternating fashion) and concatenating (appending the entire input stream of one branch to the entire stream of the other). There is no "first element to arrive on either stream" like in reactive/event systems. >> plus the queues are kept on a per-process basis (there are many more >> Nile processes than OS-level threads running) which scales well. > > > Could this arbitrary distinction between process scheduling, OS threads and > app threads (greenlets, nile process etc) be completely removed with a > pluggable hierarchical scheduler? For example, at the top-level you might > have a completely fair scheduler (4 processes = 1/4 of the time to your > process assuming you can make use of it). Within that, it's up to you to > divvy-up time. I'm visualizing this kind of how the Mach kernel had external > memory pagers that you could plug in if ever you had better page eviction > models for your domain. > > Then there's obviously how that interacts with the HW model for thread / > process switching and memory barriers but that seems like a separate problem. Again, I'm not sure I follow you here. But here goes: The Nile runtime (the multithreaded C-based one) uses OS threads only to get access to multiple cores, not because I want to use the OS scheduler to do anything for me regarding load balancing or Nile process scheduling. Ideally, the number of OS threads used in the runtime equals (nearly) the number of cores (or virtual cores if SMT is present), and the OS will assign each OS thread to a separate (virtual) core. The main scheduling and load balancing is done by the runtime at the Nile process level (as in "green" threads). Scheduling is very specific to the Nile computational model, so I don't see how having a pluggable scheduler might help. Sorry for the late response -- was on vacation. Dan _______________________________________________ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc