Hi Shawn,

On Sun, Jul 15, 2012 at 12:00 PM, Shawn Morel <shawnmo...@me.com> wrote:
>
>> The runtime also is designed to minimize L1 cache misses, more on that
>> if there is interest.
>
> I would be interested in some of the details.

I'll write more about this in a separate email.

>> Regarding the Nile C runtime, inter-thread communication is currently
>> not a bottleneck.
>
> what are the current Nile bottlenecks?

It depends on the Nile program. For several graphics pipelines, the
SortBy stage (a runtime-supplied Nile process) often takes a good part
of the time. This is because rasterization often involves sorting many
stream elements, twice. I have a handful of ideas for optimizing
SortBy that I just haven't implemented yet.

The important point here though, is that most often the performance
profiles show that most time is spent performing the essential
computations of the Nile program, rather than incidental runtime work.

>>  Queuing is a relatively infrequent
>> occurrence (compared to computation and stream data read/writing),
>
> How is that so, I would have assumed that joins in the stream processor 
> effectively become a reader writer problem.

I'm not sure I understand the question, but I'll try to answer anyway.

The Nile model is about stream processing (rather than say,
fine-grained/reactive dataflow). So the data is processed in batches
(i.e., data is buffered). Queuing/dequeuing stream data is done at
batch granularity, rather than individual data element granularity.
Thus, queuing is a relatively infrequent occurrence compared to
regular computation and individual stream element reading/writing.

Regarding joins in the process network, a single process performs the
work of combining the two incoming streams. So we have basically a
two-writer-one-reader scenario, but the two writers do not share a
queue. The joining process (the reader) pulls from two queues, and
produces output on a single queue.

It might help to know that joins in Nile are limited to zipping
(combining input elements from the two streams in an alternating
fashion) and concatenating (appending the entire input stream of one
branch to the entire stream of the other). There is no "first element
to arrive on either stream" like in reactive/event systems.

>>  plus the queues are kept on a per-process basis (there are many more
>> Nile processes than OS-level threads running) which scales well.
>
>
> Could this arbitrary distinction between process scheduling, OS threads and 
> app threads (greenlets, nile process etc) be completely removed with a 
> pluggable hierarchical scheduler? For example, at the top-level you might 
> have a completely fair scheduler (4 processes = 1/4 of the time to your 
> process assuming you can make use of it). Within that, it's up to you to 
> divvy-up time. I'm visualizing this kind of how the Mach kernel had external 
> memory pagers that you could plug in if ever you had better page eviction 
> models for your domain.
>
> Then there's obviously how that interacts with the HW model for thread / 
> process switching and memory barriers but that seems like a separate problem.

Again, I'm not sure I follow you here. But here goes:

The Nile runtime (the multithreaded C-based one) uses OS threads only
to get access to multiple cores, not because I want to use the OS
scheduler to do anything for me regarding load balancing or Nile
process scheduling. Ideally, the number of OS threads used in the
runtime equals (nearly) the number of cores (or virtual cores if SMT
is present), and the OS will assign each OS thread to a separate
(virtual) core.

The main scheduling and load balancing is done by the runtime at the
Nile process level (as in "green" threads). Scheduling is very
specific to the Nile computational model, so I don't see how having a
pluggable scheduler might help.

Sorry for the late response -- was on vacation.

Dan
_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc

Reply via email to