Re: [bitc-dev] GC, RC, and real time.

Jonathan S. Shapiro Fri, 18 Oct 2013 11:28:38 -0700

David:

The first thing I want to say here is "thank you". I am grateful that you
keep pushing back when I have my head stuck in my ass, and not just on this
occasion. So thanks.

On Thu, Oct 17, 2013 at 9:52 PM, david j <[email protected]> wrote:

> On Thu, Oct 17, 2013 at 1:55 PM, Jonathan S. Shapiro <[email protected]>wrote:
>
>> Yes, I agree that it is a cost triggered by an in-band action (releasing
>> a reference). The problem is that the magnitude (if you prefer: cost
>> variance) of that cost is highly unpredictable. If you are releasing a leaf
>> object, it's a pretty low cost. If you are releasing the root of an object
>> graph, you could be pausing for a time that is bounded only by the size of
>> the heap. *Which makes it just as bad as halt-the-world GC*.
>>
>
> I strongly don't agree that it's as bad as halt-the-world GC.
>

... because (summarizing) the programmer *can* write an algorithm that
manages the deallocation order explicitly. Issues with cycles aside, I
agree. So I'll revise my statement to say two things:

   1. It's potentially just as bad if the programmer treats reference
   counting naively and doesn't take responsibility for deallocation policy on
   large object graphs.
   2. The cases where it is bad arise much less frequently than in marking
   algorithms:
   1. They are driven by explicit actions on the part of the mutator
      (deallocations), which can be deferred if the programmer takes suitable
      actions.
      2. Their scope is limited to object graphs that have previously been
      crafted by the programmer, which are not likely to have the same scale as
      the entire heap, and in any case the pile of poo involved is
something the
      programmer can clearly understand.
      3. When the programmer chooses *not* to take direct control,
      propagation of deletes is something that is *very *amenable to
      incremental approaches, and to strategies for burden shifting the effort
      from the mutator to the collector.

Can you live with this version, or does it warrant further refinement?

The issue is whether it's *possible* to control it or not.
>

That's one issue, and obviously it *is* possible, as you pointed out. But
that's the limit case. Which is important, but the part that actually has
me more interested is the fine-grain incremental character of
reference-counted solutions. Programmers who don't *want* to understand
these issues - perhaps because their response time requirements are softer
and more informal - can very often achieve their requirements with tuning
knobs rather than taking over the details themselves. Yet the presence of
those tuning knobs need not thwart fine-grain control.

Just some dumb examples of options we could consider:

   1. Every global allocation must first process *N* entries on the dec-buf
   (incremental), N being tunable.
   2. We can provide an API through which mutators can voluntarily process *
   N* entries on the dec-buf.
   3. We can alter the size of the dec-buf.
   4. We can choose to mix tracing and reference counting strategies under
   programmer control using explicit allocation regions.
   5. We can trade immediacy of "available to realloc" for mutator delay on
   a relatively fine-grain basis.

And all of these choices can be tied back to intentional acts on the part
of the programer and realized in relatively fine-grain increments. Which is
*very* hard to do in pure tracing systems.

What has me actually kind of excited here is the ability to place these
options in the hands of the programmer *without* sacrificing safety. I've
read a bunch of papers about RC at this point. They all talk about the
advantages of freeing things promptly, but then they introduce a bunch of
deferring optimizations that reduce that effect, and nobody seems to
measure the impact of that. But the flip side is that you have *options.*

GC tracing is a hidden cost, because in between two instructions which
>>> don't allocate or deallocate you can lose alot of time to the GC.
>>>
>>
>> That's not the case in *any* GC I know about. The rendesvous/syncpoints
>> are either explicit instructions or associated with allocations.
>>
>
> I'm not aware of any threaded GC system which provides any guarantees
> about where pauses occur WRT source-lines. JVM and CLR certainly do not.
> Any allocation in any thread can trigger a heap trace, which will stop all
> threads unpredictably in-between arbitrary source lines.
>

The term "pause" is complicated, because there is a qualitative difference
between a micro-pause between instructions (which could happen in any case
due to context switch) and a full-heap-mark kind of pause.  Collectors like
C4 have only micro-pauses. The collectors that don't allow simultaneous
mutator and collector operation of course have long halts, but those halts
can only occur at known points. In an academic sort of "strict letter of
the text" kind of way, those points could be said to be known to the
programmer. But I'm picking nits, and your point is well taken.

The one thing I would say here is that we need to be able to accept either
[i.e. one of] micro-pauses or read barriers if we are going to adopt a
relocating collector (again, bennie, I haven't read the URC paper yet).

> How do you classify iOS and Python's reference counting on the spectrum of
>>> "don't make sense" vs "apply some care"? Both of them are used for many
>>> many very successful applications. (with quite low end-user perceived
>>> latencies)
>>>
>>
>> I'd classify both of them as sucking irredeemably. In both cases, other
>> factors were deemed to be MUCH more important than performance, and both
>> systems optimize for the criteria they consider important.
>>
>
> My only conclusion from this is that I don't share your definition of
> "performance", and I don't share your definition of "making sense".  :)
>

That's okay. I don't either, half the time. :-) And I sometimes suffer from
"making sense anxiety". :-)

Python is a scripting language. It's used for big things, but at the end of
the day you could quadruple the time spent by the python portion of the
application and not notice. That isn't an argument that the RC scheme is
efficient. It's an argument that from a performance perspective the python
code doesn't matter at all. When that's true, there's no reason to throw a
lot of engineering effort at faster RC implementations unless you can get
one off the shelf that was optimized for other reasons. That doesn't change
the fact that the RC scheme in python suffers all of the issues of other
unoptimized RC schemes.

If you are asking "Should Guido and crowd put work into that?" the answer
is "no". But if you are asking "Should they maybe pick up a better one if
it's hanging off a nearby tree?" The answer is "yes". Not because Python
needs that right now, but because it increases the scope of problems for
which Python is appropriate and relevant, and because it lowers the heap
size requirement for Python, which really *is* a concern in the context of
python-in-apache.

Languages have no value onto-themselves. They are tools to make useful
> applications. If really awesome useful applications can be written, and
> meet important performance goalsthen it's hard for me to say it's "sucking
> irredeemably". I find it a strange perspective to prioritize
> microbenchmarks like single-threaded mutator performance, rather than how
> the runtime facilitates overall application performance goals and SLAs.
>

So another way to frame that is that sometimes you can choose a technology
that you know sucks, but it's easy to do and it's good enough for the
problem you got. That's a good engineering call. That kind of (appropriate)
decision doesn't retroactively turn shit into gold. On the other hand, shit
turns out to be pretty useful stuff, so maybe it deserves more respect.

I just *know* I'm going to be quoted out of context on that.

Another issue to consider here is interop. Python is used in interaction
with other, larger systems. There has been a lot of push to arrive at a
single byte code system for dynamically typed languages. Partly for
interop, partly to gather all the expertise on jointly beneficial work, and
partly because there is a sense that higher performance would increase the
"reach" of the language.

Anyway, *thank* you. It's not just that you push back on me when I'm stuck.
It's that in doing so you sometimes force me to re-see the problem in an
extremely useful way. This was one of those.

Jonathan

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] GC, RC, and real time.

Reply via email to