Re: [bitc-dev] What is a "Systems Programming Language"

Ian P. Cooke Tue, 30 Jul 2013 03:44:20 -0700

On Jul 27, 2013, at 13:42 , Jonathan S. Shapiro <[email protected]> wrote:

> On Sat, Jul 27, 2013 at 11:11 AM, Ian P. Cooke <[email protected]> wrote:
> 
> On Jul 27, 2013, at 12:29 , "Jonathan S. Shapiro" <[email protected]> wrote:
>> I'm sure that I'm leaving some things out, and I don't mean to be unfair by 
>> doing so. Heres how I would state the requirements for a systems language:
>> Is a great general-purpose language that
>> Supports prescriptive stack allocation
>> In which run-time expense of computing statements and expressions is 
>> understandable to the experienced programmer
>> In which that run-time cost understanding is reasonably balanced against 
>> compilation techniques like inlining and template expansion - C++ fails this 
>> test, not because it has these features, but because of the way in which 
>> these idioms are commonly used.
>> In which certain types of safe, prescriptive storage management are possible 
>> when algorithms are written by expert programmers.
> 
> * you may care what registers are used and what the contents are
> 
> I tend to believe that if you care about this you should be using assembly 
> language. Can you give a compelling example in which this should be true for 
> high-level language code?
> 
> With the sole exception of writing really low-level runtime code (like a GC 
> implementation), I confess I've never seen a compelling case for this in 
> modern compilers. The compiler is that much better at register allocation 
> than you are.

I can't remember the case that caused me to want access to a register.  I think 
it was either something in EFLAGS or the performance counters.  Counting cache 
misses maybe?  In any case, 'easy in-line access to assembly language' is 
sufficient for me.  C, C++, Rust, and ATS have it, Java and Python doesn't.  

The other situation I ran into wasn't register-based but the lack of an 
instruction being pulled up into an intrinsic for Java.  I wanted popcnt for 
the implementation of Bagwell's HAMT but couldn't get at it efficiently which 
completely negated the 10% performance improvement popcnt would buy me.  
Eventually Sun/Oracle implemented Long.bitCount() to use popcnt if it is 
available but that optimization didn't exist at the time I needed it.  With gcc 
I was able to use __builtin_popcountll.

http://semipublic.comp-arch.net/wiki/Population_count_(POPCNT)
http://lampwww.epfl.ch/papers/idealhashtrees.pdf
http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Other-Builtins.html

>  * you care about the memory hierarchy and how long your cache lines are.  
> you have explicit ways of avoiding false sharing.
> 
> I agree that this can be very important, but I think it's beyond the scope of 
> what a high-level language should try to solve in detail. I tend to believe 
> that given known layout and sizing, the rest of this should remain in the 
> hands of the programmer. I'm also concerned that this is a place where adding 
> controls in the language may make support on certain platforms where the 
> concepts don't exist (e.g. CLR) difficult or impossible.
> 
> But this may be because I haven't thought about it adequately, and I've been 
> able to solve my particular requirements in the past with only this much to 
> go on. I can see some cases where a keyword like "cachealign" might be 
> useful, for example.
> 
> Is there anything specific that you would identify here as desirable?
>  

Yeah, I was specifically thinking of an annotation that would put the memory 
for a variable in its own cache line.  The Disruptor is a great example of Java 
whipped into use as a systems programming language without going native.  One 
of the things they do is pad an index that's written by one thread and read 
from many so as to keep it in cache a particular cache line and avoid any cache 
coherency chatter between cores.

http://lmax-exchange.github.io/disruptor/

see the use of VolatileLong here:
http://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html

and PaddedLong here:
https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/util/PaddedLong.java

note:  I can't find the reference but the implementation had to change at some 
point because the trick stopped working between some two JVM releases.  Had 
there been some kind of official @CacheLinePadded annotation you wouldn't run 
into that sort of thing.

> * you care about swapping, usually to the point that if your application is 
> swapped out, horrible things happen
> 
> No. I agree that swapping is important, but it occurs at a completely 
> different level of abstraction than a systems programming language can 
> reasonably address. I think the closest you can get to this in practice is to 
> say that you sometimes require the ability to tightly bound the heap 
> footprint of your program.
> 
> If this is a general requirement for all programs, you have bigger problems 
> than a programming language and runtime can solve for you. IOS comes to mind 
> as an example of a design that suffers bad problems in this regard.
>  

I usually think of memory allocation as a language-level thing but you're right 
that I'm thinking of an OS-level optimization.  For example, how would you 
design things such that you could still malloc all the memory you needed at 
startup and then mlockall and pre-fault it?  I guess I said 'swapping' when I 
meant 'faulting'; I try and avoid both.

http://pubs.opengroup.org/onlinepubs/007908799/xsh/mlockall.html

see comments/code re: mlockall
https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
https://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO

> * you care about how language-level threads are mapped to cores and how they 
> are scheduled (priorities, context-switches, etc)
> 
> That's important, but it's not a language-level issue. The language needs to 
> provide adequate support for concurrency, and the environment library needs 
> to get that mapped to OS threads and cores, but the latter isn't directly a 
> language design issue.
> 
> I do agree that there are ways to booger the language design so as to make 
> this impossible, and those are good things to avoid.

It is a runtime issue though, right?  At startup of the runtime or first thing 
in main() I want to be able to say 'this hardware thread and this hardware 
thread only goes to this core and enter the runloop of this task to that 
thread' so that all sub-tasks run there.  That way I keep those tasks from 
switching contexts unless I want them to.  Priorities I can handle at the 
OS-level so long as I can easily identify the tid.  I want to be able to talk 
to whatevers scheduling tasks to get them mapped out how I like.

> I think that it's your number 3 that causes people to avoid garbage 
> collection (at least it does to me).  Even if you understand the cost of the 
> kinds of collection you usually don't know when it's going to happen.
> 
> I've been hearing that argument for 35 years now. During those 35 years, I've 
> heard only two or three design pattern scenarios in which (assuming a 
> properly implemented GC) the programmer had any business caring about this.

For those patterns do you consider the solution "if you care then don't 
allocate any more memory"?  In that case I definitely want a NoAlloc effect :)

>  
> Usually there are critical sections where you must not be preempted by either 
> the garbage collection runtime or the kernel.  
> 
> If true, this is an important issue, but I'm not convinced that it's true. I 
> understand how critical sections work, but in all of the use-cases I've seen 
> non-preemption by the kernel isn't the right answer.
> 
> Non-preemption by GC is relevant when you have a hard time bound. And even 
> there, strictly speaking, a sufficiently small preemption is OK. The place 
> where this really becomes a problem is when GC can be triggered by a 
> second-party thread that isn't in a critical section, and can therefore 
> impact the behavior of the thread in the critical section. This kind of thing 
> can't really be solved without more sophisticated types of heap isolation and 
> garbage collection than are currently widespread.
> 
> I do have some approaches to dealing with this running around in my head. 
> Talking about real-world use cases would be very helpful here.

if your event loop completes in single-digit microseconds, can you make a 
sufficiently small preemption small enough?  Granted at that point we're 
looking at specialized hardware to get the network stack and maybe the rest of 
the kernel out of the way but I was speculating on how much you can eek out of 
a language/runtime/processor without such an expense.

Tick to Trade latency:
https://en.wikipedia.org/wiki/High-frequency_trading
http://www.redlinetrading.com/pdf/Intel%20Redline%20STAC-T1_TechBrief_2013.pdf
http://www.businesswire.com/news/home/20121029005168/en/Redline-Trading-Solutions-Achieves-6.1-Microsecond-Tick-to-Trade

>  
> If there's a way to disable the GC for the duration of those sections 
> (possible the lifetime of the application) then great.  Or, I suppose, if you 
> have the hardware resources you can dedicate one or more cores for GC and 
> say, 'run over there and never take actions that will interrupt my 
> application threads running on these other cores' , that would be acceptable. 
>  
> 
> Is that what you meant by number 5?  Assuming the presence of an expert 
> programmer can they write a complex application that allocates all the memory 
> it needs on start up and then run with a guarantee that the GC will not cost 
> a single cycle?  I think it falls under that 'pay for what you use' 
> philosophy.  If I don't need a collector then any time spent running one, 
> even if it doesn't interrupt my program, is time wasted that could be used by 
> other processes.
> 
> I'm certainly thinking about mechanisms that could achieve what you are 
> describing in suitably careful codes, and that would make checking the 
> careful codes for compliance possible. We coded both EROS and Coyotos this 
> way, for example, and it's well within what is feasible and reasonable to 
> type the fact that the program does not allocate after startup.
> 
> Though I'd personally be willing to accept an approach that required a 
> compacting GC to be run before the non-allocating "main loop" began.
> 
> The "pay for what you use" philosophy is problematic. While I understand the 
> appeal, in practice it too often turns into "everyone pays for your ability 
> to not use a few things".

I can see that; certainly applications this sensitive are in the minority.  
However, this minority is currently relegated to a very few intense guru-level 
Java programmers and mostly C++ guys.  I figure that's a good target for a new 
systems programming language.  I know some Java guys that can do it so I'm sure 
it won't be impossible in whatever scheme you come up with but how much of a 
burden on the developer will it be?  If it's an easy sell to these guys you can 
definitely gain traction with everyone else.

-ipc

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] What is a "Systems Programming Language"

Reply via email to